CTPN - Detecting Text in Natural Image with Connectionist Text Proposal Network

Words List (appearance)
#	word	phonetic	sentence
1	connectionist	[kə'nekʃənɪst]	Detecting Text in Natural Image with Connectionist Text Proposal Network 用连接式文本建议网络检测自然图像中的文本 We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. 我们提出了一种新颖的连接文本提议网络（CTPN），它能够准确定位自然图像中的文本行。 We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers. 我们提出了一种新颖的连接文本提议网络（CTPN），它可以直接定位卷积层中的文本序列。 Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). 图1：（a）连接文本提议网络（CTPN）的架构。 3. Connectionist Text Proposal Network 3. 连接文本提议网络 This section presents details of the Connectionist Text Proposal Network (CTPN). 本节介绍连接文本提议网络（CTPN）的细节。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement. 它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.2 Recurrent Connectionist Text Proposals 3.2 循环连接文本提议 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$. 我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。 4.3 Recurrent Connectionist Text Proposals 4.3 循环连接文本提议 We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable. 我们提出了连接文本提议网络（CTPN）—— 一种可端到端训练的高效文本检测器。
2	CTPN	[!≈ si: ti: pi: en]	We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. 我们提出了一种新颖的连接文本提议网络（CTPN），它能够准确定位自然图像中的文本行。 The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. CTPN直接在卷积特征映射中的一系列细粒度文本提议中检测文本行。 This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. 这使得CTPN可以探索丰富的图像上下文信息，使其能够检测极其模糊的文本。 The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. CTPN在多尺度和多语言文本上可靠地工作，而不需要进一步的后处理，脱离了以前的自底向上需要多步后过滤的方法。 The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27]. 通过使用非常深的VGG16模型[27]，CTPN的计算效率为0.14s每张图像。 We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers. 我们提出了一种新颖的连接文本提议网络（CTPN），它可以直接定位卷积层中的文本序列。 We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1. 我们利用强深度卷积特性和共享计算机制的优点，提出了如图1所示的CTPN架构。 Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN). 图1：（a）连接文本提议网络（CTPN）的架构。 (b) The CTPN outputs sequential fixed-width fine-scale text proposals. （b）CTPN输出连续的固定宽度细粒度文本提议。 This section presents details of the Connectionist Text Proposal Network (CTPN). 本节介绍连接文本提议网络（CTPN）的细节。 Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. 类似于区域提议网络（RPN）[25]，CTPN本质上是一个全卷积网络，允许任意大小的输入图像。 Architecture of the CTPN is presented in Fig. 1 (a). CTPN的架构如图1（a）所示。 Fig. 3: Top: CTPN without RNN. 图3：上：没有RNN的CTPN。 Bottom: CTPN with RNN connection. 下：有RNN连接的CTPN。 The fine-scale text proposals are detected accurately and reliably by our CTPN. 我们的CTPN能够准确可靠地检测细粒度的文本提议。 Fig. 4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement. 图4：CTPN检测有（红色框）和没有（黄色虚线框）边缘细化。 The proposed CTPN has three outputs which are jointly connected to the last FC layer, as shown in Fig. 1 (a). 提出的CTPN有三个输出共同连接到最后的FC层，如图1（a）所示。 The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD). 通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。 This is crucial to detect small-scale text patterns, which is one of key advantages of the CTPN. 这对于检测小规模文本模式至关重要，这是CTPN的主要优势之一。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. 我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 We discuss impact of recurrent connection on our CTPN. 我们讨论循环连接对CTPN的影响。 It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6. 对于恢复高度模糊的文本（例如极小的文本）来说，这非常重要，这是我们CTPN的主要优势之一，如图6所示。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88. 如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 Fig. 6: CTPN detection results on extremely small-scale cases (in red boxes), where some ground truth boxes are missed. 图6：在极小尺度的情况下（红色框内）CTPN检测结果，其中一些真实边界框被遗漏。 The implementation time of our CTPN (for whole detection processing) is about 0.14s per image with a fixed short side of 600, by using a single GPU. 通过使用单个GPU，我们的CTPN（用于整个检测处理）的执行时间为每张图像大约0.14s，固定短边为600。 The CTPN without the RNN connection takes about 0.13s/image GPU time. 没有RNN连接的CTPN每张图像GPU时间大约需要0.13s。 As can be found, the CTPN works perfectly on these challenging cases, some of which are difficult for many previous methods. 可以发现，CTPN在这些具有挑战性的情况上可以完美的工作，其中一些对于许多以前的方法来说是困难的。 Fig. 5: CTPN detection results several challenging images, including multi-scale and multi-language text lines. 图5：CTPN在几个具有挑战性的图像上的检测结果，包括多尺度和多语言文本行。 As shown in Table 1 and 2, our CTPN achieves the best performance on all five datasets. 如表1和表2所示，我们的CTPN在所有的五个数据集上都实现了最佳性能。 This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human. 这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。 We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable. 我们提出了连接文本提议网络（CTPN）—— 一种可端到端训练的高效文本检测器。 The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional maps. CTPN直接在卷积映射的一系列细粒度文本提议中检测文本行。 The CTPN is efficient by achieving new state-of-the-art performance on five benchmarks, with 0.14s/image running time. 通过在五个基准数据集测试中实现了最佳性能，每张图像运行时间为0.14s，CTPN是有效的。
3	fine-scale	[!≈ faɪn skeɪl]	The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. CTPN直接在卷积特征映射中的一系列细粒度文本提议中检测文本行。 Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information. 然后，我们提出了一种网内循环架构，用于按顺序连接这些细粒度的文本提议，从而允许它们编码丰富的上下文信息。 (b) The CTPN outputs sequential fixed-width fine-scale text proposals. （b）CTPN输出连续的固定宽度细粒度文本提议。 First, we cast the problem of text detection into localizing a sequence of fine-scale text proposals. 首先，我们将文本检测的问题转化为一系列细粒度的文本提议。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement. 它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.1 Detecting Text in Fine-scale Proposals 3.1 在细粒度提议中检测文本 It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b). 它通过在卷积特征映射中密集地滑动小窗口来检测文本行，并且输出一系列细粒度的（例如，宽度为固定的16个像素）文本提议，如图1（b）所示。 Right: Fine-scale text proposals. 右：细粒度的文本提议。 It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width. 将文本行视为一系列细粒度的文本提议是很自然的，其中每个提议通常代表文本行的一小部分，例如宽度为16个像素的文本块。 We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal. 我们开发了垂直锚点机制，可以同时预测每个细粒度提议的文本/非文本分数和y轴的位置。 To this end, we design the fine-scale text proposal as follow. 为此，我们设计如下的细粒度文本提议。 By the designed vertical anchor and fine-scale detection strategy, our detector is able to handle text lines in a wide range of scales and aspect ratios by using a single-scale image. 通过设计的垂直锚点和细粒度的检测策略，我们的检测器能够通过使用单尺度图像处理各种尺度和长宽比的文本行。 Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection. 与RPN或Faster R-CNN系统[25]相比，我们的细粒度检测提供更详细的监督信息，自然会导致更精确的检测。 To improve localization accuracy, we split a text line into a sequence of fine-scale text proposals, and predict each of them separately. 为了提高定位精度，我们将文本行分成一系列细粒度的文本提议，并分别预测每个文本提议。 Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals. 此外，我们的目标是直接在卷积层中编码这些信息，从而实现细粒度文本提议优雅无缝的网内连接。 The fine-scale text proposals are detected accurately and reliably by our CTPN. 我们的CTPN能够准确可靠地检测细粒度的文本提议。 The fine-scale detection and RNN connection are able to predict accurate localizations in vertical direction. 细粒度的检测和RNN连接可以预测垂直方向的精确位置。 The side-proposals are defined as the start and end proposals when we connect a sequence of detected fine-scale text proposals into a text line. 当我们将一系列检测到的细粒度文本提议连接到文本行中时，这些提议被定义为开始和结束提议。 Color of fine-scale proposal box indicate a text/non-text score. 细粒度提议边界框的颜色表示文本/非文本分数。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection. 在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 4.2 Fine-Scale Text Proposal Network with Faster R-CNN 4.2 具有Faster R-CNN的细粒度文本提议网络 We first discuss our fine-scale detection strategy against the RPN and Faster R-CNN system [25]. 我们首先讨论我们关于RPN和Faster R-CNN系统[25]的细粒度检测策略。 Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line. 显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line. 显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional maps. CTPN直接在卷积映射的一系列细粒度文本提议中检测文本行。
4	jointly	[dʒɔɪntlɪ]	We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. 我们开发了一个垂直锚点机制，联合预测每个固定宽度提议的位置和文本/非文本分数，大大提高了定位精度。 The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. RNN层连接到512维的全连接层，接着是输出层，联合预测k个锚点的文本/非文本分数，y轴坐标和边缘调整偏移。 We develop an anchor regression mechanism that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy. 我们开发了一个锚点回归机制，可以联合预测每个文本提议的垂直位置和文本/非文本分数，从而获得出色的定位精度。 W is the width of the conv5. $H_t$ is a recurrent internal state that is computed jointly from both current input ($X_t$) and previous states encoded in $H_{t-1}$. $H_t$是从当前输入（$X_t$）和以$H_{t-1}$编码的先前状态联合计算的循环内部状态。 The proposed CTPN has three outputs which are jointly connected to the last FC layer, as shown in Fig. 1 (a). 提出的CTPN有三个输出共同连接到最后的FC层，如图1（a）所示。 We employ multi-task learning to jointly optimize model parameters. 我们采用多任务学习来联合优化模型参数。 We develop vertical anchor mechanism that jointly predicts precise location and text/non-text score for each proposal, which is the key to realize accurate localization of text. 我们开发了垂直锚点机制，联合预测每个提议的精确位置和文本/非文本分数，这是实现文本准确定位的关键。
5	sequential	[sɪˈkwenʃl]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. 序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). 每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 (b) The CTPN outputs sequential fixed-width fine-scale text proposals. （b）CTPN输出连续的固定宽度细粒度文本提议。 Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. 其次，我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。 Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision. 文本具有强大的序列特征，序列上下文信息对做出可靠决策至关重要。 Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision. 文本具有强大的序列特征，序列上下文信息对做出可靠决策至关重要。 Their results have shown that the sequential context information is greatly facilitate the recognition task on cropped word images. 他们的结果表明，序列上下文信息极大地促进了对裁剪的单词图像的识别任务。 To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, $H_t$, 为此，我们提出在conv5上设计一个RNN层，它将每个窗口的卷积特征作为序列输入，并在隐藏层中循环更新其内部状态：$H_t$， The sliding-window moves densely from left to right, resulting in $t=1,2,…,W$ sequential features for each row. W是conv5的宽度。 Hence the internal state in RNN hidden layer accesses the sequential context information scanned by all previous windows through the recurrent connection. 因此，RNN隐藏层中的内部状态可以访问所有先前窗口通过循环连接扫描的序列上下文信息。 We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information. 我们提出了一个网内RNN层，可以优雅地连接顺序文本提议，使其能够探索有意义的上下文信息。
6	recurrent	[rɪˈkʌrənt]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. 序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 Scene text detection, convolutional network, recurrent neural network, anchor mechanism 场景文本检测；卷积网络；循环神经网络；锚点机制 Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information. 然后，我们提出了一种网内循环架构，用于按顺序连接这些细粒度的文本提议，从而允许它们编码丰富的上下文信息。 We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model. 我们通过提出一种网络内循环机制争取更进一步，使我们的模型能够直接在卷积映射中检测文本序列，避免通过额外昂贵的CNN检测模型进行进一步的后处理。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement. 它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.2 Recurrent Connectionist Text Proposals 3.2 循环连接文本提议 This has been verified by recent work [9] where a recurrent neural network (RNN) is applied to encode this context information for text recognition. 最近的工作已经证实了这一点[9]，其中应用递归神经网络（RNN）来编码用于文本识别的上下文信息。 W is the width of the conv5. $H_t$ is a recurrent internal state that is computed jointly from both current input ($X_t$) and previous states encoded in $H_{t-1}$. $H_t$是从当前输入（$X_t$）和以$H_{t-1}$编码的先前状态联合计算的循环内部状态。 The recurrence is computed by using a non-linear function $\varphi$, which defines exact form of the recurrent model. 递归是通过使用非线性函数$\varphi$来计算的，它定义了循环模型的确切形式。 Hence the internal state in RNN hidden layer accesses the sequential context information scanned by all previous windows through the recurrent connection. 因此，RNN隐藏层中的内部状态可以访问所有先前窗口通过循环连接扫描的序列上下文信息。 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$. 我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection. 在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 4.3 Recurrent Connectionist Text Proposals 4.3 循环连接文本提议 We discuss impact of recurrent connection on our CTPN. 我们讨论循环连接对CTPN的影响。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88. 如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained. 因此，所提出的网内循环机制稍微增加了模型计算，并获得了相当大的性能增益。
7	seamlessly	['si:mlisli]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. 序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model. 第三，两种方法无缝集成，以符合文本序列的性质，从而形成统一的端到端可训练模型。
8	incorporate	[ɪnˈkɔ:pəreɪt]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. 序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。
9	trainable	[t'reɪnəbl]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. 序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model. 第三，两种方法无缝集成，以符合文本序列的性质，从而形成统一的端到端可训练模型。 Therefore, our integration with the RNN layer is elegant, resulting in an efficient model that is end-to-end trainable without additional cost. 因此，我们与RNN层的集成非常优雅，从而形成了一种高效的模型，可以在无需额外成本的情况下进行端到端的训练。 We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable. 我们提出了连接文本提议网络（CTPN）—— 一种可端到端训练的高效文本检测器。
10	F-measure	[!≈ ef ˈmeʒə(r)]	It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. 它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure，大大超过了最近的结果[8，35]。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). 第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). 第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 By refining the RPN proposals with a Fast R-CNN detection model [5], the Faster R-CNN system improves localization accuracy considerably, with a F-measure of 0.75. 通过使用Fast R-CNN检测模型[5]完善RPN提议，Faster R-CNN系统显著提高了定位精度，其F-measure为0.75。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88. 如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 On the SWT, our improvements are significant on both recall and F-measure, with marginal gain on precision. 在SWT上，我们的改进对于召回和F-measure都非常重要，并在精确度上取得了很小的收益。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. 在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。 It consistently obtains substantial improvements on F-measure and recall. 它始终在F-measure和召回率方面取得重大进展。 Regardless of running time, our method outperforms the FASText substantially with $11\%$ improvement on F-measure. 无论运行时间如何，我们的方法都大大优于FASText，F-measure的性能提高了11%。
11	ICDAR	[!≈ aɪ si: di: eɪ ɑ:(r)]	It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. 它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure，大大超过了最近的结果[8，35]。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). 第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). 第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27]. 此外，通过使用非常深的VGG16模型[27]，这在计算上是高效的，导致了每张图像0.14s的运行时间（在ICDAR 2013上）。 Our model was trained on 3,000 natural images, including 229 images from the ICDAR 2013 training set. 我们的模型在3000张自然图像上训练，其中包括来自ICDAR 2013训练集的229张图像。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. 我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. 我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. 我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 The ICDAR 2013 is used for this component evaluation. ICDAR 2013用于该组件的评估。 The ICDAR 2011 dataset [21] consists of 229 training images and 255 testing ones, where the images are labelled in word level. ICDAR 2011数据集[21]由229张训练图像和255张测试图像组成，图像以字级别标记。 The ICDAR 2013 [19] is similar as the ICDAR 2011, and has in total 462 images, including 229 images and 233 images for training and testing, respectively. ICDAR 2013[19]与ICDAR 2011类似，共有462张图像，其中包括229张训练图像和233张测试图像。 The ICDAR 2013 [19] is similar as the ICDAR 2011, and has in total 462 images, including 229 images and 233 images for training and testing, respectively. ICDAR 2013[19]与ICDAR 2011类似，共有462张图像，其中包括229张训练图像和233张测试图像。 The ICDAR 2015 (Incidental Scene Text - Challenge 4) [18] includes 1,500 images which were collected by using the Google Glass. ICDAR 2015年（Incidental Scene Text —— Challenge 4）[18]包括使用Google Glass收集的1500张图像。 For the ICDAR 2011 we use the standard protocol proposed by [30], the evaluation on the ICDAR 2013 follows the standard in [19]. 对于ICDAR 2011，我们使用[30]提出的标准协议，对ICDAR 2013的评估遵循[19]中的标准。 For the ICDAR 2011 we use the standard protocol proposed by [30], the evaluation on the ICDAR 2013 follows the standard in [19]. 对于ICDAR 2011，我们使用[30]提出的标准协议，对ICDAR 2013的评估遵循[19]中的标准。 For the ICDAR 2015, we used the online evaluation system provided by the organizers as in [18]. 对于ICDAR 2015，我们使用了由组织者提供的在线评估系统[18]。 The RPN proposals may roughly localize a major part of a text line or word, but they are not accurate enough by the ICDAR 2013 standard. RPN提议可以粗略定位文本行或文字的主要部分，但根据ICDAR 2013的标准这不够准确。 Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL. 表1：ICDAR 2013的组件评估以及在SWT和MULTILENGUAL数据集上的最新成果。 We set short side of images to 2000 for the SWT and ICDAR 2015, and 600 for the other three. 我们为SWT和ICDAR 2015设置图像短边为2000，其他三个的短边为600。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. 在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。 Table 2: State-of-the-art results on the ICDAR 2011, 2013 and 2015. 表2：ICDAR 2011，2013和2015上的最新结果。 By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’ s approach [8] using 0.07s/image with GPU. 在ICDAR 2013中，使用450的缩放比例时间降低到0.09s每张图像，同时获得0.92/0.77/0.84的P/R/F，与Gupta等人的方法[8]相比，GPU时间为0.07s每张图像，我们的方法是具有竞争力的。
12	surpass	[səˈpɑ:s]	It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin. 它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure，大大超过了最近的结果[8，35]。
13	computationally	[!≈ ˌkɒmpjuˈteɪʃənli]	The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27]. 通过使用非常深的VGG16模型[27]，CTPN的计算效率为0.14s每张图像。 Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27]. 此外，通过使用非常深的VGG16模型[27]，这在计算上是高效的，导致了每张图像0.14s的运行时间（在ICDAR 2013上）。 Another limitation is that the sliding-window methods are computationally expensive, by running a classifier on a huge number of the sliding windows. 另一个限制是通过在大量的滑动窗口上运行分类器，滑动窗口方法在计算上是昂贵的。
14	VGG16		The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27]. 通过使用非常深的VGG16模型[27]，CTPN的计算效率为0.14s每张图像。 We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27]. 我们通过VGG16模型[27]的最后一个卷积映射（conv5）密集地滑动3×3空间窗口。 Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27]. 此外，通过使用非常深的VGG16模型[27]，这在计算上是高效的，导致了每张图像0.14s的运行时间（在ICDAR 2013上）。 We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models. 我们以非常深的16层vggNet（VGG16）[27]为例来描述我们的方法，该方法很容易应用于其他深度模型。 We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16). 我们使用一个小的空间窗口3×3来滑动最后的卷积层特征映射（例如，VGG16的conv5）。 Given an input image, we have $W \times H \times C$ conv5 features maps (by using the VGG16 model), where C is the number of feature maps or channels, and $W \times H$ is the spatial arrangement. 给定输入图像，我们有$W \times H \times C$ conv5特征映射（通过使用VGG16模型），其中C是特征映射或通道的数目，并且$W \times H$是空间布置。 We follow the standard practice, and explore the very deep VGG16 model [27] pre-trained on the ImageNet data [26]. 我们遵循标准实践，并在ImageNet数据[26]上探索预先训练的非常深的VGG16模型[27]。
15	retrieval	[rɪˈtri:vl]	This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc. It includes two sub tasks: text detection and recognition. 这是由于它的许多实际应用，如图像OCR，多语言翻译，图像检索等。它包括两个子任务：文本检测和识别。
16	variance	[ˈveəriəns]	Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization. 文本模式的大变化和高度杂乱的背景构成了精确文本定位的主要挑战。
17	clutter	[ˈklʌtə(r)]	Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization. 文本模式的大变化和高度杂乱的背景构成了精确文本定位的主要挑战。
18	verification	[ˌverɪfɪ'keɪʃn]	They commonly start from low-level character or stroke detection, which is typically followed by a number of subsequent steps: non-text component filtering, text line construction and text line verification. 它们通常从低级别字符或笔画检测开始，后面通常会跟随一些后续步骤：非文本组件过滤，文本行构建和文本行验证。
19	robustness	[rəʊ'bʌstnəs]	These multi-step bottom-up approaches are generally complicated with less robustness and reliability. 这些自底向上的多步骤方法通常复杂，鲁棒性和可靠性较差。
20	connected-component	[!≈ kə'nektɪd kəmˈpəʊnənt]	Their performance heavily rely on the results of character detection, and connected-components methods or sliding-window methods have been proposed. 它们的性能很大程度上依赖于字符检测的结果，并且已经提出了连接组件方法或滑动窗口方法。 They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods. 它们可以粗略地分为两类，基于连接组件（CC）的方法和基于滑动窗口的方法。
21	e.g.	[ˌi: ˈdʒi:]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. 这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。 For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. 对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。 For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. 对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。 Therefore, text detection generally requires a more accurate localization, leading to a different evaluation standard, e.g., the Wolf’s standard [30] which is commonly employed by text benchmarks [19,21]. 因此，文本检测通常需要更准确的定义，导致不同的评估标准，例如文本基准中常用的Wolf标准[19，21]。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015). 第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3]. 基于CC的方法通过使用快速滤波器来区分文本和非文本像素，然后通过使用低级属性（例如强度，颜色，梯度等[33，14，32，13，3]）将文本像素贪婪地分为笔划或候选字符。 However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. 然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。 It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b). 它通过在卷积特征映射中密集地滑动小窗口来检测文本行，并且输出一系列细粒度的（例如，宽度为固定的16个像素）文本提议，如图1（b）所示。 We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16). 我们使用一个小的空间窗口3×3来滑动最后的卷积层特征映射（例如，VGG16的conv5）。 Text detection is defined in word or text line level, so that it may be easy to make an incorrect detection by defining it as a single object, e.g., detecting part of a word. 文本检测是在单词或文本行级别中定义的，因此通过将其定义为单个目标（例如检测单词的一部分）可能很容易进行错误的检测。 It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width. 将文本行视为一系列细粒度的文本提议是很自然的，其中每个提议通常代表文本行的一小部分，例如宽度为16个像素的文本块。 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$. 我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。 This may lead to an inaccurate localization when the text proposals in both horizontal sides are not exactly covered by a ground truth text line area, or some side proposals are discarded (e.g., having a low text score), as shown in Fig. 4. 如图4所示，当两个水平边的文本提议没有完全被实际文本行区域覆盖，或者某些边的提议被丢弃（例如文本得分较低）时，这可能会导致不准确的定位。 where $x_{side}$ is the predicted x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor. 其中，$x_{side}$是最接近水平边（例如，左边或右边）到当前锚点的预测的x坐标。 k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box. k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离（例如32个像素）内的一组锚点。$\textbf{o}_k$和$\textbf{o}_k^*$是与第k个锚点关联的x轴的预测和实际偏移量。 We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation. 我们通过使用具有0均值和0.01标准差的高斯分布的随机权重来初始化新层（例如，RNN和输出层）。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection. 在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6. 对于恢复高度模糊的文本（例如极小的文本）来说，这非常重要，这是我们CTPN的主要优势之一，如图6所示。 It is able to handle multi-scale and multi-language efficiently (e.g., Chinese and Korean). 它能够有效地处理多尺度和多语言（例如中文和韩文）。 This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human. 这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。
22	SWT	['esd'əbəlju:t'i:]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. 这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。 The side-refinement further improves the localization accuracy, leading to about $2\%$ performance improvements on the SWT and Multi-Lingual datasets. 边缘细化进一步提高了定位精度，从而使SWT和Multi-Lingual数据集上的性能提高了约2%。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. 我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text. Epshtein等[3]引入了包含307张图像的SWT数据集，其中包含许多极小尺度的文本。 The evaluations on the SWT and Multilingual datasets follow the protocols defined in [3] and [24] respectively. SWT和Multilingual数据集的评估分别遵循[3]和[24]中定义的协议。 Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL. 表1：ICDAR 2013的组件评估以及在SWT和MULTILENGUAL数据集上的最新成果。 We set short side of images to 2000 for the SWT and ICDAR 2015, and 600 for the other three. 我们为SWT和ICDAR 2015设置图像短边为2000，其他三个的短边为600。 On the SWT, our improvements are significant on both recall and F-measure, with marginal gain on precision. 在SWT上，我们的改进对于召回和F-measure都非常重要，并在精确度上取得了很小的收益。
23	MSER	[!≈ em es i: ɑ:(r)]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. 这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。
24	HoG	[hɒg]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background. 这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。
25	sequentially	[sɪ'kwenʃəlɪ]	Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline, as pointed out in [28]. 此外，正如[28]所指出的，这些误检很容易在自下而上的过程中连续累积。 Then a text line is constructed by sequentially connecting the pairs having a same proposal. 然后通过顺序连接具有相同提议的对来构建文本行。
26	in-network	[!≈ ɪn ˈnetwɜ:k]	Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information. 然后，我们提出了一种网内循环架构，用于按顺序连接这些细粒度的文本提议，从而允许它们编码丰富的上下文信息。 We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model. 我们通过提出一种网络内循环机制争取更进一步，使我们的模型能够直接在卷积映射中检测文本序列，避免通过额外昂贵的CNN检测模型进行进一步的后处理。 Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. 其次，我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。 Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals. 此外，我们的目标是直接在卷积层中编码这些信息，从而实现细粒度文本提议优雅无缝的网内连接。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection. 在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained. 因此，所提出的网内循环机制稍微增加了模型计算，并获得了相当大的性能增益。 We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information. 我们提出了一个网内RNN层，可以优雅地连接顺序文本提议，使其能够探索有意义的上下文信息。
27	substantially	[səbˈstænʃəli]	Deep Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. 深度卷积神经网络（CNN）最近已经基本实现了一般物体检测[25，5，6]。 Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6]. 卷积神经网络（CNN）近来在通用目标检测[25，5，6]上已经取得了实质的进步。 However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2]. 然而，实质上文本与普通目标不同，它们通常具有明确的封闭边界和中心，可以从它的一部分推断整个目标[2]。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88. 如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 Regardless of running time, our method outperforms the FASText substantially with $11\%$ improvement on F-measure. 无论运行时间如何，我们的方法都大大优于FASText，F-measure的性能提高了11%。
28	Region-CNN		The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps. 最先进的方法是Faster Region-CNN（R-CNN）系统[25]，其中提出了区域提议网络（RPN）直接从卷积特征映射中生成高质量类别不可知的目标提议。
29	RPN	[!≈ ɑ:(r) pi: en]	The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps. 最先进的方法是Faster Region-CNN（R-CNN）系统[25]，其中提出了区域提议网络（RPN）直接从卷积特征映射中生成高质量类别不可知的目标提议。 Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection. 然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调，从而实现通用目标检测的最新性能。 In this work, we fill this gap by extending the RPN architecture [25] to accurate text line localization. 在这项工作中，我们通过将RPN架构[25]扩展到准确的文本行定义来填补这个空白。 This departs from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy. 这背离了整个目标的RPN预测，RPN预测难以提供令人满意的定位精度。 They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps. 他们提出了一个区域提议网络（RPN），可以直接从卷积特征映射中生成高质量的类别不可知的目标提议。 The RPN is fast by sharing convolutional computation. 通过共享卷积计算RPN是快速的。 However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. 然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。 Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. 类似于区域提议网络（RPN）[25]，CTPN本质上是一个全卷积网络，允许任意大小的输入图像。 In [25], Ren et al. proposed an efficient anchor regression mechanism that allows the RPN to detect multi-scale objects with a single-scale window. 在[25]中，Ren等人提出了一种有效的锚点回归机制，允许RPN使用单尺度窗口检测多尺度目标。 An example is shown in Fig. 2, where the RPN is directly trained for localizing text lines in an image. 一个例子如图2所示，其中RPN直接被训练用于定位图像中的文本行。 Fig. 2: Left: RPN proposals. 图2：左：RPN提议。 We observed that word detection by the RPN is difficult to accurately predict the horizontal sides of words, since each character within a word is isolated or separated, making it confused to find the start and end locations of a word. 我们观察到由RPN进行的单词检测很难准确预测单词的水平边，因为单词中的每个字符都是孤立的或分离的，这使得查找单词的开始和结束位置很混乱。 This reduces the search space, compared to the RPN which predicts 4 coordinates of an object. 与预测目标4个坐标的RPN相比，这减少了搜索空间。 Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection. 与RPN或Faster R-CNN系统[25]相比，我们的细粒度检测提供更详细的监督信息，自然会导致更精确的检测。 Similar to RPN [25], training samples are the anchors, whose locations can be pre computed in input image, so that the training labels of each anchor can be computed from corresponding GT box. 与RPN[25]类似，训练样本是锚点，其位置可以在输入图像中预先计算，以便可以从相应的实际边界框中计算每个锚点的训练标签。 We first discuss our fine-scale detection strategy against the RPN and Faster R-CNN system [25]. 我们首先讨论我们关于RPN和Faster R-CNN系统[25]的细粒度检测策略。 As can be found in Table 1 (left), the individual RPN is difficult to perform accurate text localization, by generating a large amount of false detections (low precision). 如表1（左）所示，通过产生大量的错误检测（低精度），单独的RPN难以执行准确的文本定位。 By refining the RPN proposals with a Fast R-CNN detection model [5], the Faster R-CNN system improves localization accuracy considerably, with a F-measure of 0.75. 通过使用Fast R-CNN检测模型[5]完善RPN提议，Faster R-CNN系统显著提高了定位精度，其F-measure为0.75。 One observation is that the Faster R-CNN also increases the recall of original RPN. 一个观察结果是Faster R-CNN也增加了原始RPN的召回率。 The RPN proposals may roughly localize a major part of a text line or word, but they are not accurate enough by the ICDAR 2013 standard. RPN提议可以粗略定位文本行或文字的主要部分，但根据ICDAR 2013的标准这不够准确。
30	class-agnostic	[!≈ klɑ:s ægˈnɒstɪk]	The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps. 最先进的方法是Faster Region-CNN（R-CNN）系统[25]，其中提出了区域提议网络（RPN）直接从卷积特征映射中生成高质量类别不可知的目标提议。 Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5]. 生成类别不可知目标提议的选择性搜索（SS）[4]是目前领先的目标检测系统中应用最广泛的方法之一，如CNN（R-CNN）[6]及其扩展[5]。 They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps. 他们提出了一个区域提议网络（RPN），可以直接从卷积特征映射中生成高质量的类别不可知的目标提议。
31	refinement	[rɪˈfaɪnmənt]	Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection. 然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调，从而实现通用目标检测的最新性能。 Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement. 我们的方法能够在单个过程中处理多尺度和多语言的文本，避免进一步的后过滤或细化。 However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. 然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。
32	generic	[dʒəˈnerɪk]	Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection. 然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调，从而实现通用目标检测的最新性能。 In generic object detection, each object has a well-defined closed boundary [2], while such a well-defined boundary may not exist in text, since a text line or word is composed of a number of separate characters or strokes. 在通用目标检测中，每个目标都有一个明确的封闭边界[2]，而在文本中可能不存在这样一个明确定义的边界，因为文本行或单词是由许多单独的字符或笔划组成的。 We present several technical developments that tailor generic object detection model elegantly towards our problem. 我们提出了几种技术发展，针对我们的问题可以优雅地调整通用目标检测模型。 However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2]. 然而，实质上文本与普通目标不同，它们通常具有明确的封闭边界和中心，可以从它的一部分推断整个目标[2]。 Obviously, a text line is a sequence which is the main difference between text and generic objects. 显然，文本行是一个序列，它是文本和通用目标之间的主要区别。 This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words. 这种不准确性在通用目标检测中可能并不重要，但在文本检测中不应忽视，特别是对于那些小型文本行或文字。 This is different from generic object detection where the impact of condition (ii) may be not significant. 这不同于通用目标检测，通用目标检测中条件（ii）的影响可能不显著。
33	bounding	[baundɪŋ]	For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. 对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。 The explicit vertical coordinates are measured by the height and y-axis center of a proposal bounding box. 明确的垂直坐标是通过提议边界框的高度和y轴中心来度量的。 We compute relative predicted vertical coordinates ($\textbf{v}$) with respect to the bounding box location of an anchor as, 我们计算相对于锚点的边界框位置的相对预测的垂直坐标（$\textbf{v}$），如下所示： Therefore, each predicted text proposal has a bounding box with size of $h\times 16$ (in the input image), as shown in Fig. 1 (b) and Fig. 2 (right). 因此，如图1（b）和图2（右）所示，每个预测文本提议都有一个大小为$h\times 16$的边界框（在输入图像中）。 $x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. $x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 We only use the offsets of the side-proposals to refine the final text line bounding box. 我们只使用边缘提议的偏移量来优化最终的文本行边界框。 k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box. k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离（例如32个像素）内的一组锚点。$\textbf{o}_k$和$\textbf{o}_k^*$是与第k个锚点关联的x轴的预测和实际偏移量。 It is defined by computing the IoU overlap with the GT bounding box (divided by anchor location). 它通过计算与实际边界框的IoU重叠（除以锚点位置）来定义。 We collected the other images ourselves and manually labelled them with text line bounding boxes. 我们自己收集了其他图像，并用文本行边界框进行了手工标注。 This may benefit from joint bounding box regression mechanism of the Fast R-CNN, which improves the accuracy of a predicted bounding box. 这可能受益于Fast R-CNN的联合边界框回归机制，其提高了预测边界框的准确性。 This may benefit from joint bounding box regression mechanism of the Fast R-CNN, which improves the accuracy of a predicted bounding box. 这可能受益于Fast R-CNN的联合边界框回归机制，其提高了预测边界框的准确性。
34	PASCAL	['pæskәl]	For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it. 对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。
35	comprehensively	[ˌkɒmprɪˈhensɪvli]	By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word. 相比之下，综合阅读文本是一个细粒度的识别任务，需要正确的检测，覆盖文本行或字的整个区域。
36	fine-grained	[faɪn'greɪnd]	By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word. 相比之下，综合阅读文本是一个细粒度的识别任务，需要正确的检测，覆盖文本行或字的整个区域。
37	elegantly	['elɪɡəntlɪ]	We present several technical developments that tailor generic object detection model elegantly towards our problem. 我们提出了几种技术发展，针对我们的问题可以优雅地调整通用目标检测模型。 Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps. 其次，我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。 We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information. 我们提出了一个网内RNN层，可以优雅地连接顺序文本提议，使其能够探索有意义的上下文信息。
38	leverage	[ˈli:vərɪdʒ]	We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1. 我们利用强深度卷积特性和共享计算机制的优点，提出了如图1所示的CTPN架构。
39	recurrently	[rɪ'kʌrəntlɪ]	The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). 每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 RNN provides a natural choice for encoding this information recurrently using its hidden layers. RNN提供了一种自然选择，使用其隐藏层对这些信息进行循环编码。 To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, $H_t$, 为此，我们提出在conv5上设计一个RNN层，它将每个窗口的卷积特征作为序列输入，并在隐藏层中循环更新其内部状态：$H_t$，
40	bi-directional	['bɪdɪr'ekʃənl]	The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). 每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$. 我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。
41	BLSTM	[!≈ bi: el es ti: em]	The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). 每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs). 每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。
42	y-axis	[ˈwaiˌæksis]	The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. RNN层连接到512维的全连接层，接着是输出层，联合预测k个锚点的文本/非文本分数，y轴坐标和边缘调整偏移。 We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal. 我们开发了垂直锚点机制，可以同时预测每个细粒度提议的文本/非文本分数和y轴的位置。 The explicit vertical coordinates are measured by the height and y-axis center of a proposal bounding box. 明确的垂直坐标是通过提议边界框的高度和y轴中心来度量的。 $c_y^a$ and $h^a$ are the center (y-axis) and height of the anchor box, which can be pre-computed from an input image. $c_y^a$和$h^a$是锚盒的中心（y轴）和高度，可以从输入图像预先计算。 $c_y$ and $h$ are the predicted y-axis coordinates in the input image, while $c^_y$ and $h^$ are the ground truth coordinates. $c_y$和$h$是输入图像中预测的y轴坐标，而$c^_y$和$h^$是实际坐标。
43	side-refinement	[!≈ saɪd rɪˈfaɪnmənt]	The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors. RNN层连接到512维的全连接层，接着是输出层，联合预测k个锚点的文本/非文本分数，y轴坐标和边缘调整偏移。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement. 它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.3 Side-refinement 3.3 边缘细化 To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal). 为了解决这个问题，我们提出了一种边缘细化的方法，可以精确地估计左右两侧水平方向上的每个锚点/提议的偏移量（称为边缘锚点或边缘提议）。 Several detection examples improved by side-refinement are presented in Fig. 4. 通过边缘细化改进的几个检测示例如图4所示。 The side-refinement further improves the localization accuracy, leading to about $2\%$ performance improvements on the SWT and Multi-Lingual datasets. 边缘细化进一步提高了定位精度，从而使SWT和Multi-Lingual数据集上的性能提高了约2%。 Notice that the offset for side-refinement is predicted simultaneously by our model, as shown in Fig. 1. 请注意，我们的模型同时预测了边缘细化的偏移量，如图1所示。 Fig. 4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement. 图4：CTPN检测有（红色框）和没有（黄色虚线框）边缘细化。 q. (2) and side-refinement offset ($\textbf{o}$). We explore k anchors to predict them on each spatial location in the conv5, resulting in 2k, 2k and k parameters in the output layer, respectively. 我们将探索k个锚点来预测它们在conv5中的每个空间位置，从而在输出层分别得到2k，2k和k个参数。 We introduce three loss functions, $L^{cl}_s, L^{re}_v and l^{re}_o$, which compute errors of text/non-text score, coordinate and side-refinement, respectively. 我们引入了三种损失函数：$L^{cl}_s$，$L^{re}_v$和$l^{re}_o$，其分别计算文本/非文本分数，坐标和边缘细化。
44	multi-lingual	[!≈ 'mʌlti ˈlɪŋgwəl]	Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement. 我们的方法能够在单个过程中处理多尺度和多语言的文本，避免进一步的后过滤或细化。 The side-refinement further improves the localization accuracy, leading to about $2\%$ performance improvements on the SWT and Multi-Lingual datasets. 边缘细化进一步提高了定位精度，从而使SWT和Multi-Lingual数据集上的性能提高了约2%。
45	cc	[ˌsi: ˈsi:]	They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods. 它们可以粗略地分为两类，基于连接组件（CC）的方法和基于滑动窗口的方法。 The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3]. 基于CC的方法通过使用快速滤波器来区分文本和非文本像素，然后通过使用低级属性（例如强度，颜色，梯度等[33，14，32，13，3]）将文本像素贪婪地分为笔划或候选字符。
46	greedily	['gri:dɪlɪ]	The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3]. 基于CC的方法通过使用快速滤波器来区分文本和非文本像素，然后通过使用低级属性（例如强度，颜色，梯度等[33，14，32，13，3]）将文本像素贪婪地分为笔划或候选字符。
47	robustly	[rəʊ'bʌstlɪ]	Furthermore, robustly filtering out non-character components or confidently verifying detected text lines are even difficult themselves [1,33,14]. 此外，强大地过滤非字符组件或者自信地验证检测到的文本行本身就更加困难[1，33，14]。
48	inexpensive	[ˌɪnɪkˈspensɪv]	A common strategy is to generate a number of object proposals by employing inexpensive low-level features, and then a strong CNN classifier is applied to further classify and refine the generated proposals. 一个常见的策略是通过使用廉价的低级特征来生成许多目标提议，然后使用强CNN分类器来进一步对生成的提议进行分类和细化。
49	Selective	[sɪˈlektɪv]	Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5]. 生成类别不可知目标提议的选择性搜索（SS）[4]是目前领先的目标检测系统中应用最广泛的方法之一，如CNN（R-CNN）[6]及其扩展[5]。
50	SS	[!≈ es es]	Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5]. 生成类别不可知目标提议的选择性搜索（SS）[4]是目前领先的目标检测系统中应用最广泛的方法之一，如CNN（R-CNN）[6]及其扩展[5]。
51	discriminative	[dɪs'krɪmɪnətɪv]	However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5]. 然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。
52	domain-specific	[!≈ dəˈmeɪn spəˈsɪfɪk]	More importantly, text is different significantly from general objects, making it difficult to directly apply general object detection system to this highly domain-specific task. 更重要的是，文本与一般目标有很大的不同，因此很难直接将通用目标检测系统应用到这个高度领域化的任务中。
53	arbitrary	[ˈɑ:bɪtrəri]	Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size. 类似于区域提议网络（RPN）[25]，CTPN本质上是一个全卷积网络，允许任意大小的输入图像。 This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text. 这个数据集比以前的数据集更具挑战性，包括任意方向，非常小的尺度和低分辨率的文本。
54	applicable	[əˈplɪkəbl]	We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models. 我们以非常深的16层vggNet（VGG16）[27]为例来描述我们的方法，该方法很容易应用于其他深度模型。
55	receptive	[rɪˈseptɪv]	The size of conv5 feature maps is determined by the size of input image, while the total stride and receptive field are fixed as 16 and 228 pixels, respectively. conv5特征映射的大小由输入图像的大小决定，而总步长和感受野分别固定为16个和228个像素。 Both the total stride and receptive field are fixed by the network architecture. 网络架构决定总步长和感受野。 Generally, an text proposal is largely smaller than its effective receptive field which is $228\times228$. 一般来说，文本提议在很大程度上要比它的有效感受野$228\times228$要小。
56	y-coordinate	[ˌwaikəuˈɔ:dinət,-neit]	Then we design k vertical anchors to predict y-coordinates for each proposal. 然后，我们设计k个垂直锚点来预测每个提议的y坐标。 Our detector outputs the text/non-text scores and the predicted y-coordinates ($\textbf{v}$) for k anchors at each window location. 我们的检测器在每个窗口位置输出k个锚点的文本/非文本分数和预测的y轴坐标（$\textbf{v}$）。 Similar to the y-coordinate prediction, we compute relative offset as, 与y坐标预测类似，我们计算相对偏移为： $\textbf{s}_i^=\lbrace 0,1\rbrace$ is the ground truth. j is the index of an anchor in the set of valid anchors for y-coordinates regression, which are defined as follow. $\textbf{s}_i^=\lbrace 0,1\rbrace$是真实值。$j$是$y$坐标回归中有效锚点集合中锚点的索引，定义如下。 $\textbf{v}_j$ and $\textbf{v}_j^$ are the prediction and ground truth y-coordinates associated with the $j-{th}$ anchor. $\textbf{v}_j$和$\textbf{v}_j^$是与第j个锚点关联的预测的和真实的y坐标。 The training labels for the y-coordinate regression ($\textbf{v}^$) and offset regression ($\textbf{o}^$) are computed as E. q. (2) and (4) respectively. y坐标回归（$\textbf{v}^$）和偏移回归（$\textbf{o}^$）的训练标签分别按公式（2）和（4）计算。
57	x-coordinate	['ekskəʊ'ɔ:dnɪt]	For each prediction, the horizontal location (x-coordinates) and k-anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image. 对于每个预测，水平位置（x轴坐标）和k个锚点位置是固定的，可以通过将conv5中的空间窗口位置映射到输入图像上来预先计算。 where $x_{side}$ is the predicted x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor. 其中，$x_{side}$是最接近水平边（例如，左边或右边）到当前锚点的预测的x坐标。
58	k-anchor	[!≈ keɪ ˈæŋkə(r)]	For each prediction, the horizontal location (x-coordinates) and k-anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image. 对于每个预测，水平位置（x轴坐标）和k个锚点位置是固定的，可以通过将conv5中的空间窗口位置映射到输入图像上来预先计算。
59	suppression	[səˈpreʃn]	The detected text proposals are generated from the anchors having a text/non-text score of >0.7 (with non-maximum suppression). 检测到的文本提议是从具有> 0.7（具有非极大值抑制）的文本/非文本分数的锚点生成的。
60	localizations	[!≈ ˌləʊkəlaɪ'zeɪʃnz]	This further reduces its computation, and at the same time, predicting accurate localizations of the text lines. 这进一步减少了计算量，同时预测了文本行的准确位置。 The fine-scale detection and RNN connection are able to predict accurate localizations in vertical direction. 细粒度的检测和RNN连接可以预测垂直方向的精确位置。
61	outliers	[aʊt'laɪəz]	This may lead to a number of false detections on non-text objects which have a similar structure as text patterns, such as windows, bricks, leaves, etc. (referred as text-like outliers in [13]). 这可能会导致对与文本模式类似的非文本目标的误检，如窗口，砖块，树叶等（在文献[13]中称为类文本异常值）。 As shown in Fig. 3, the context information is greatly helpful to reduce false detections, such as text-like outliers. 如图3所示，上下文信息对于减少误检非常有用，例如类似文本的异常值。
62	seamless	[ˈsi:mləs]	Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals. 此外，我们的目标是直接在卷积层中编码这些信息，从而实现细粒度文本提议优雅无缝的网内连接。
63	multiplicative	['mʌltɪplɪkeɪtɪv]	The LSTM was proposed specially to address vanishing gradient problem, by introducing three additional multiplicative gates: the input gate, forget gate and output gate. 通过引入三个附加乘法门：输入门，忘记门和输出门，专门提出了LSTM以解决梯度消失问题。
64	inaccuracy	[ɪn'ækjərəsɪ]	This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words. 这种不准确性在通用目标检测中可能并不重要，但在文本检测中不应忽视，特别是对于那些小型文本行或文字。
65	small-scale	[ˈsmɔ:lˈskeɪl]	This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words. 这种不准确性在通用目标检测中可能并不重要，但在文本检测中不应忽视，特别是对于那些小型文本行或文字。 This is crucial to detect small-scale text patterns, which is one of key advantages of the CTPN. 这对于检测小规模文本模式至关重要，这是CTPN的主要优势之一。 This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text. 这个数据集比以前的数据集更具挑战性，包括任意方向，非常小的尺度和低分辨率的文本。 Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text. Epshtein等[3]引入了包含307张图像的SWT数据集，其中包含许多极小尺度的文本。 It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6. 对于恢复高度模糊的文本（例如极小的文本）来说，这非常重要，这是我们CTPN的主要优势之一，如图6所示。 Fig. 6: CTPN detection results on extremely small-scale cases (in red boxes), where some ground truth boxes are missed. 图6：在极小尺度的情况下（红色框内）CTPN检测结果，其中一些真实边界框被遗漏。 This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human. 这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。
66	side-anchor	[!≈ saɪd ˈæŋkə(r)]	To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal). 为了解决这个问题，我们提出了一种边缘细化的方法，可以精确地估计左右两侧水平方向上的每个锚点/提议的偏移量（称为边缘锚点或边缘提议）。 k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box. k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离（例如32个像素）内的一组锚点。$\textbf{o}_k$和$\textbf{o}_k^*$是与第k个锚点关联的x轴的预测和实际偏移量。
67	side-proposal	[!≈ saɪd prəˈpəʊzl]	To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal). 为了解决这个问题，我们提出了一种边缘细化的方法，可以精确地估计左右两侧水平方向上的每个锚点/提议的偏移量（称为边缘锚点或边缘提议）。 The side-proposals are defined as the start and end proposals when we connect a sequence of detected fine-scale text proposals into a text line. 当我们将一系列检测到的细粒度文本提议连接到文本行中时，这些提议被定义为开始和结束提议。 We only use the offsets of the side-proposals to refine the final text line bounding box. 我们只使用边缘提议的偏移量来优化最终的文本行边界框。
68	GT	[dʒi:'ti:]	$x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. $x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 $x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. $x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 Similar to RPN [25], training samples are the anchors, whose locations can be pre computed in input image, so that the training labels of each anchor can be computed from corresponding GT box. 与RPN[25]类似，训练样本是锚点，其位置可以在输入图像中预先计算，以便可以从相应的实际边界框中计算每个锚点的训练标签。 It is defined by computing the IoU overlap with the GT bounding box (divided by anchor location). 它通过计算与实际边界框的IoU重叠（除以锚点位置）来定义。 A positive anchor is defined as : (i) an anchor that has an > 0.7 IoU overlap with any GT box; or (ii) the anchor with the highest IoU overlap with a GT box. 正锚点被定义为：（i）与任何实际边界框具有>0.7的IoU重叠；或者（ii）与实际边界框具有最高IoU重叠。 A positive anchor is defined as : (i) an anchor that has an > 0.7 IoU overlap with any GT box; or (ii) the anchor with the highest IoU overlap with a GT box. 正锚点被定义为：（i）与任何实际边界框具有>0.7的IoU重叠；或者（ii）与实际边界框具有最高IoU重叠。 The negative anchors are defined as <0.5 IoU overlap with all GT boxes. 负锚点定义为与所有实际边界框具有<0.5的IoU重叠。 As shown in Fig. 6, those challenging ones are detected correctly by our detector, but some of them are even missed by the GT labelling, which may reduce our precision in evaluation. 如图6所示，我们的检测器可以正确地检测到那些具有挑战性的图像，但有些甚至会被真实标签遗漏，这可能会降低我们的评估精度。
69	x-axis	[ˈeksˌæksis]	$x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location. $x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 $c_x^a$ is the center of anchor in x-axis. $c_x^a$是x轴的锚点的中心。 $\textbf{o}_k$ and $\textbf{o}_k^*$ are the predicted and ground truth offsets in x-axis associated to the $k-{th}$ anchor. $L^{cl}_s$是我们使用Softmax损失区分文本和非文本的分类损失。
70	dash	[dæʃ]	Fig. 4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement. 图4：CTPN检测有（红色框）和没有（黄色虚线框）边缘细化。
71	Intersection-over-Union	[!≈ ˌɪntəˈsekʃn ˈəʊvə(r) ˈju:niən]	A valid anchor is a defined positive anchor ($\textbf{s}_j^=1$, described below), or has an Intersection-over-Union (IoU) >0.5 overlap with a ground truth text proposal. 有效的锚点是定义的正锚点（$\textbf{s}_j^=1$，如下所述），或者与实际文本提议重叠的交并比（IoU）>0.5。
72	empirically	[ɪm'pɪrɪklɪ]	$\lambda_1$ and $\lambda_2$ are loss weights to balance different tasks, which are empirically set to 1.0 and 2.0. $N_{s}$, $N_{v}$ and $N_{o}$ are normalization parameters, denoting the total number of anchors used by $L^{cl}_s$, $L^{re}_v$ and $L^{re}_o$, respectively. $N_{s}$, $N_{v}$和$N_{o}$是标准化参数，表示$L^{cl}_s$，$L^{re}_v$，$L^{re}_o$分别使用的锚点总数。
73	stochastic	[stə'kæstɪk]	The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD). 通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。
74	descent	[dɪˈsent]	The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD). 通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。
75	SGD	['esdʒ'i:d'i:]	The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD). 通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。
76	resize	[ˌri:ˈsaɪz]	The input image is resized by setting its short side to 600 for training, while keeping its original aspect ratio. 为了训练，通过将输入图像的短边设置为600来调整输入图像的大小，同时保持其原始长宽比。
77	Gaussian	['gaʊsɪən]	We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation. 我们通过使用具有0均值和0.01标准差的高斯分布的随机权重来初始化新层（例如，RNN和输出层）。
78	deviation	[ˌdi:viˈeɪʃn]	We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation. 我们通过使用具有0均值和0.01标准差的高斯分布的随机权重来初始化新层（例如，RNN和输出层）。
79	momentum	[məˈmentəm]	We used 0.9 momentum and 0.0005 weight decay. 我们使用0.9的动量和0.0005的重量衰减。
80	Caffe		Our model was implemented in Caffe framework [17]. 我们的模型在Caffe框架[17]中实现。
81	Multilingual	[ˌmʌltiˈlɪŋgwəl]	We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24]. 我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 The Multilingual scene text dataset is collected by [24]. Multilingual场景文本数据集由[24]收集。 The evaluations on the SWT and Multilingual datasets follow the protocols defined in [3] and [24] respectively. SWT和Multilingual数据集的评估分别遵循[3]和[24]中定义的协议。 Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL. 表1：ICDAR 2013的组件评估以及在SWT和MULTILENGUAL数据集上的最新成果。 Our detector performs favourably against the TextFlow on the Multilingual, suggesting that our method generalize well to various languages. 我们的检测器在Multilingual上比TextFlow表现更好，表明我们的方法能很好地泛化到各种语言。
82	Incidental	[ˌɪnsɪˈdentl]	The ICDAR 2015 (Incidental Scene Text - Challenge 4) [18] includes 1,500 images which were collected by using the Google Glass. ICDAR 2015年（Incidental Scene Text —— Challenge 4）[18]包括使用Google Glass收集的1500张图像。
83	orientation	[ˌɔ:riənˈteɪʃn]	This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text. 这个数据集比以前的数据集更具挑战性，包括任意方向，非常小的尺度和低分辨率的文本。
84	Epshtein		Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text. Epshtein等[3]引入了包含307张图像的SWT数据集，其中包含许多极小尺度的文本。
85	organizer	['ɔ:ɡənaɪzə(r)]	We follow previous work by using standard evaluation protocols which are provided by the dataset creators or competition organizers. 我们遵循以前的工作，使用由数据集创建者或竞赛组织者提供的标准评估协议。 For the ICDAR 2015, we used the online evaluation system provided by the organizers as in [18]. 对于ICDAR 2015，我们使用了由组织者提供的在线评估系统[18]。
86	FTPN	[!≈ ef ti: pi: en]	Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line. 显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line. 显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88. 如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。
87	remarkably	[rɪ'mɑ:kəblɪ]	Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line. 显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. 在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。
88	marginally	[ˈmɑ:dʒɪnəli]	Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained. 因此，所提出的网内循环机制稍微增加了模型计算，并获得了相当大的性能增益。
89	TextFlow		Our detector performs favourably against the TextFlow on the Multilingual, suggesting that our method generalize well to various languages. 我们的检测器在Multilingual上比TextFlow表现更好，表明我们的方法能很好地泛化到各种语言。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. 在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。
90	FASText	[fɑːs'tekst]	On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88. 在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。 FASText [1] achieves 0.15s/image CPU time. FASText[1]达到0.15s每张图像的CPU时间。 Regardless of running time, our method outperforms the FASText substantially with $11\%$ improvement on F-measure. 无论运行时间如何，我们的方法都大大优于FASText，F-measure的性能提高了11%。
91	submission	[səbˈmɪʃn]	In addition, we further compare our method against [8,11,35], which were published after our initial submission. 此外，我们进一步与[8,11,35]比较了我们的方法，它们是在我们的首次提交后发布的。
92	consistently	[kən'sɪstəntlɪ]	It consistently obtains substantial improvements on F-measure and recall. 它始终在F-measure和召回率方面取得重大进展。
93	capability	[ˌkeɪpəˈbɪləti]	This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human. 这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。
94	competitively	[!≈ kəmˈpetətɪvli]	By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’ s approach [8] using 0.07s/image with GPU. 在ICDAR 2013中，使用450的缩放比例时间降低到0.09s每张图像，同时获得0.92/0.77/0.84的P/R/F，与Gupta等人的方法[8]相比，GPU时间为0.07s每张图像，我们的方法是具有竞争力的。
95	Gupta		By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’ s approach [8] using 0.07s/image with GPU. 在ICDAR 2013中，使用450的缩放比例时间降低到0.09s每张图像，同时获得0.92/0.77/0.84的P/R/F，与Gupta等人的方法[8]相比，GPU时间为0.07s每张图像，我们的方法是具有竞争力的。

Words List (frequency)
#	word (frequency)	phonetic	sentence
1	CTPN (33)	[!≈ si: ti: pi: en]	We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image.我们提出了一种新颖的连接文本提议网络（CTPN），它能够准确定位自然图像中的文本行。 The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps.CTPN直接在卷积特征映射中的一系列细粒度文本提议中检测文本行。 This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text.这使得CTPN可以探索丰富的图像上下文信息，使其能够检测极其模糊的文本。 The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering.CTPN在多尺度和多语言文本上可靠地工作，而不需要进一步的后处理，脱离了以前的自底向上需要多步后过滤的方法。 The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27].通过使用非常深的VGG16模型[27]，CTPN的计算效率为0.14s每张图像。 We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers.我们提出了一种新颖的连接文本提议网络（CTPN），它可以直接定位卷积层中的文本序列。 We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1.我们利用强深度卷积特性和共享计算机制的优点，提出了如图1所示的CTPN架构。 Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN).图1：（a）连接文本提议网络（CTPN）的架构。 (b) The CTPN outputs sequential fixed-width fine-scale text proposals.（b）CTPN输出连续的固定宽度细粒度文本提议。 This section presents details of the Connectionist Text Proposal Network (CTPN).本节介绍连接文本提议网络（CTPN）的细节。 Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size.类似于区域提议网络（RPN）[25]，CTPN本质上是一个全卷积网络，允许任意大小的输入图像。 Architecture of the CTPN is presented in Fig. 1 (a).CTPN的架构如图1（a）所示。 Fig. 3: Top: CTPN without RNN.图3：上：没有RNN的CTPN。 Bottom: CTPN with RNN connection.下：有RNN连接的CTPN。 The fine-scale text proposals are detected accurately and reliably by our CTPN.我们的CTPN能够准确可靠地检测细粒度的文本提议。 Fig. 4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement.图4：CTPN检测有（红色框）和没有（黄色虚线框）边缘细化。 The proposed CTPN has three outputs which are jointly connected to the last FC layer, as shown in Fig. 1 (a).提出的CTPN有三个输出共同连接到最后的FC层，如图1（a）所示。 The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD).通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。 This is crucial to detect small-scale text patterns, which is one of key advantages of the CTPN.这对于检测小规模文本模式至关重要，这是CTPN的主要优势之一。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24].我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 We discuss impact of recurrent connection on our CTPN.我们讨论循环连接对CTPN的影响。 It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6.对于恢复高度模糊的文本（例如极小的文本）来说，这非常重要，这是我们CTPN的主要优势之一，如图6所示。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 Fig. 6: CTPN detection results on extremely small-scale cases (in red boxes), where some ground truth boxes are missed.图6：在极小尺度的情况下（红色框内）CTPN检测结果，其中一些真实边界框被遗漏。 The implementation time of our CTPN (for whole detection processing) is about 0.14s per image with a fixed short side of 600, by using a single GPU.通过使用单个GPU，我们的CTPN（用于整个检测处理）的执行时间为每张图像大约0.14s，固定短边为600。 The CTPN without the RNN connection takes about 0.13s/image GPU time.没有RNN连接的CTPN每张图像GPU时间大约需要0.13s。 As can be found, the CTPN works perfectly on these challenging cases, some of which are difficult for many previous methods.可以发现，CTPN在这些具有挑战性的情况上可以完美的工作，其中一些对于许多以前的方法来说是困难的。 Fig. 5: CTPN detection results several challenging images, including multi-scale and multi-language text lines.图5：CTPN在几个具有挑战性的图像上的检测结果，包括多尺度和多语言文本行。 As shown in Table 1 and 2, our CTPN achieves the best performance on all five datasets.如表1和表2所示，我们的CTPN在所有的五个数据集上都实现了最佳性能。 This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human.这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。 We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable.我们提出了连接文本提议网络（CTPN）—— 一种可端到端训练的高效文本检测器。 The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional maps.CTPN直接在卷积映射的一系列细粒度文本提议中检测文本行。 The CTPN is efficient by achieving new state-of-the-art performance on five benchmarks, with 0.14s/image running time.通过在五个基准数据集测试中实现了最佳性能，每张图像运行时间为0.14s，CTPN是有效的。
2	fine-scale (25)	[!≈ faɪn skeɪl]	The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps.CTPN直接在卷积特征映射中的一系列细粒度文本提议中检测文本行。 Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.然后，我们提出了一种网内循环架构，用于按顺序连接这些细粒度的文本提议，从而允许它们编码丰富的上下文信息。 (b) The CTPN outputs sequential fixed-width fine-scale text proposals.（b）CTPN输出连续的固定宽度细粒度文本提议。 First, we cast the problem of text detection into localizing a sequence of fine-scale text proposals.首先，我们将文本检测的问题转化为一系列细粒度的文本提议。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.1 Detecting Text in Fine-scale Proposals3.1 在细粒度提议中检测文本 It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b).它通过在卷积特征映射中密集地滑动小窗口来检测文本行，并且输出一系列细粒度的（例如，宽度为固定的16个像素）文本提议，如图1（b）所示。 Right: Fine-scale text proposals.右：细粒度的文本提议。 It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width.将文本行视为一系列细粒度的文本提议是很自然的，其中每个提议通常代表文本行的一小部分，例如宽度为16个像素的文本块。 We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal.我们开发了垂直锚点机制，可以同时预测每个细粒度提议的文本/非文本分数和y轴的位置。 To this end, we design the fine-scale text proposal as follow.为此，我们设计如下的细粒度文本提议。 By the designed vertical anchor and fine-scale detection strategy, our detector is able to handle text lines in a wide range of scales and aspect ratios by using a single-scale image.通过设计的垂直锚点和细粒度的检测策略，我们的检测器能够通过使用单尺度图像处理各种尺度和长宽比的文本行。 Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection.与RPN或Faster R-CNN系统[25]相比，我们的细粒度检测提供更详细的监督信息，自然会导致更精确的检测。 To improve localization accuracy, we split a text line into a sequence of fine-scale text proposals, and predict each of them separately.为了提高定位精度，我们将文本行分成一系列细粒度的文本提议，并分别预测每个文本提议。 Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals.此外，我们的目标是直接在卷积层中编码这些信息，从而实现细粒度文本提议优雅无缝的网内连接。 The fine-scale text proposals are detected accurately and reliably by our CTPN.我们的CTPN能够准确可靠地检测细粒度的文本提议。 The fine-scale detection and RNN connection are able to predict accurate localizations in vertical direction.细粒度的检测和RNN连接可以预测垂直方向的精确位置。 The side-proposals are defined as the start and end proposals when we connect a sequence of detected fine-scale text proposals into a text line.当我们将一系列检测到的细粒度文本提议连接到文本行中时，这些提议被定义为开始和结束提议。 Color of fine-scale proposal box indicate a text/non-text score.细粒度提议边界框的颜色表示文本/非文本分数。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection.在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 4.2 Fine-Scale Text Proposal Network with Faster R-CNN4.2 具有Faster R-CNN的细粒度文本提议网络 We first discuss our fine-scale detection strategy against the RPN and Faster R-CNN system [25].我们首先讨论我们关于RPN和Faster R-CNN系统[25]的细粒度检测策略。 Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional maps.CTPN直接在卷积映射的一系列细粒度文本提议中检测文本行。
3	ICDAR (22)	[!≈ aɪ si: di: eɪ ɑ:(r)]	It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin.它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure，大大超过了最近的结果[8，35]。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015).第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015).第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].此外，通过使用非常深的VGG16模型[27]，这在计算上是高效的，导致了每张图像0.14s的运行时间（在ICDAR 2013上）。 Our model was trained on 3,000 natural images, including 229 images from the ICDAR 2013 training set.我们的模型在3000张自然图像上训练，其中包括来自ICDAR 2013训练集的229张图像。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24].我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24].我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24].我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 The ICDAR 2013 is used for this component evaluation.ICDAR 2013用于该组件的评估。 The ICDAR 2011 dataset [21] consists of 229 training images and 255 testing ones, where the images are labelled in word level.ICDAR 2011数据集[21]由229张训练图像和255张测试图像组成，图像以字级别标记。 The ICDAR 2013 [19] is similar as the ICDAR 2011, and has in total 462 images, including 229 images and 233 images for training and testing, respectively.ICDAR 2013[19]与ICDAR 2011类似，共有462张图像，其中包括229张训练图像和233张测试图像。 The ICDAR 2013 [19] is similar as the ICDAR 2011, and has in total 462 images, including 229 images and 233 images for training and testing, respectively.ICDAR 2013[19]与ICDAR 2011类似，共有462张图像，其中包括229张训练图像和233张测试图像。 The ICDAR 2015 (Incidental Scene Text - Challenge 4) [18] includes 1,500 images which were collected by using the Google Glass.ICDAR 2015年（Incidental Scene Text —— Challenge 4）[18]包括使用Google Glass收集的1500张图像。 For the ICDAR 2011 we use the standard protocol proposed by [30], the evaluation on the ICDAR 2013 follows the standard in [19].对于ICDAR 2011，我们使用[30]提出的标准协议，对ICDAR 2013的评估遵循[19]中的标准。 For the ICDAR 2011 we use the standard protocol proposed by [30], the evaluation on the ICDAR 2013 follows the standard in [19].对于ICDAR 2011，我们使用[30]提出的标准协议，对ICDAR 2013的评估遵循[19]中的标准。 For the ICDAR 2015, we used the online evaluation system provided by the organizers as in [18].对于ICDAR 2015，我们使用了由组织者提供的在线评估系统[18]。 The RPN proposals may roughly localize a major part of a text line or word, but they are not accurate enough by the ICDAR 2013 standard.RPN提议可以粗略定位文本行或文字的主要部分，但根据ICDAR 2013的标准这不够准确。 Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL.表1：ICDAR 2013的组件评估以及在SWT和MULTILENGUAL数据集上的最新成果。 We set short side of images to 2000 for the SWT and ICDAR 2015, and 600 for the other three.我们为SWT和ICDAR 2015设置图像短边为2000，其他三个的短边为600。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88.在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。 Table 2: State-of-the-art results on the ICDAR 2011, 2013 and 2015.表2：ICDAR 2011，2013和2015上的最新结果。 By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’ s approach [8] using 0.07s/image with GPU.在ICDAR 2013中，使用450的缩放比例时间降低到0.09s每张图像，同时获得0.92/0.77/0.84的P/R/F，与Gupta等人的方法[8]相比，GPU时间为0.07s每张图像，我们的方法是具有竞争力的。
4	e.g. (20)	[ˌi: ˈdʒi:]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background.这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。 For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it.对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。 For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it.对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。 Therefore, text detection generally requires a more accurate localization, leading to a different evaluation standard, e.g., the Wolf’s standard [30] which is commonly employed by text benchmarks [19,21].因此，文本检测通常需要更准确的定义，导致不同的评估标准，例如文本基准中常用的Wolf标准[19，21]。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015).第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3].基于CC的方法通过使用快速滤波器来区分文本和非文本像素，然后通过使用低级属性（例如强度，颜色，梯度等[33，14，32，13，3]）将文本像素贪婪地分为笔划或候选字符。 However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5].然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。 It detects a text line by densely sliding a small window in the convolutional feature maps, and outputs a sequence of fine-scale (e.g., fixed 16-pixel width) text proposals, as shown in Fig. 1 (b).它通过在卷积特征映射中密集地滑动小窗口来检测文本行，并且输出一系列细粒度的（例如，宽度为固定的16个像素）文本提议，如图1（b）所示。 We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16).我们使用一个小的空间窗口3×3来滑动最后的卷积层特征映射（例如，VGG16的conv5）。 Text detection is defined in word or text line level, so that it may be easy to make an incorrect detection by defining it as a single object, e.g., detecting part of a word.文本检测是在单词或文本行级别中定义的，因此通过将其定义为单个目标（例如检测单词的一部分）可能很容易进行错误的检测。 It is natural to consider a text line as a sequence of fine-scale text proposals, where each proposal generally represents a small part of a text line, e.g., a text piece with 16-pixel width.将文本行视为一系列细粒度的文本提议是很自然的，其中每个提议通常代表文本行的一小部分，例如宽度为16个像素的文本块。 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$.我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。 This may lead to an inaccurate localization when the text proposals in both horizontal sides are not exactly covered by a ground truth text line area, or some side proposals are discarded (e.g., having a low text score), as shown in Fig. 4.如图4所示，当两个水平边的文本提议没有完全被实际文本行区域覆盖，或者某些边的提议被丢弃（例如文本得分较低）时，这可能会导致不准确的定位。 where $x_{side}$ is the predicted x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor.其中，$x_{side}$是最接近水平边（例如，左边或右边）到当前锚点的预测的x坐标。 k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box.k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离（例如32个像素）内的一组锚点。$\textbf{o}_k$和$\textbf{o}_k^*$是与第k个锚点关联的x轴的预测和实际偏移量。 We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation.我们通过使用具有0均值和0.01标准差的高斯分布的随机权重来初始化新层（例如，RNN和输出层）。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection.在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6.对于恢复高度模糊的文本（例如极小的文本）来说，这非常重要，这是我们CTPN的主要优势之一，如图6所示。 It is able to handle multi-scale and multi-language efficiently (e.g., Chinese and Korean).它能够有效地处理多尺度和多语言（例如中文和韩文）。 This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human.这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。
5	RPN (20)	[!≈ ɑ:(r) pi: en]	The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps.最先进的方法是Faster Region-CNN（R-CNN）系统[25]，其中提出了区域提议网络（RPN）直接从卷积特征映射中生成高质量类别不可知的目标提议。 Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection.然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调，从而实现通用目标检测的最新性能。 In this work, we fill this gap by extending the RPN architecture [25] to accurate text line localization.在这项工作中，我们通过将RPN架构[25]扩展到准确的文本行定义来填补这个空白。 This departs from the RPN prediction of a whole object, which is difficult to provide a satisfied localization accuracy.这背离了整个目标的RPN预测，RPN预测难以提供令人满意的定位精度。 They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps.他们提出了一个区域提议网络（RPN），可以直接从卷积特征映射中生成高质量的类别不可知的目标提议。 The RPN is fast by sharing convolutional computation.通过共享卷积计算RPN是快速的。 However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5].然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。 Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size.类似于区域提议网络（RPN）[25]，CTPN本质上是一个全卷积网络，允许任意大小的输入图像。 In [25], Ren et al. proposed an efficient anchor regression mechanism that allows the RPN to detect multi-scale objects with a single-scale window.在[25]中，Ren等人提出了一种有效的锚点回归机制，允许RPN使用单尺度窗口检测多尺度目标。 An example is shown in Fig. 2, where the RPN is directly trained for localizing text lines in an image.一个例子如图2所示，其中RPN直接被训练用于定位图像中的文本行。 Fig. 2: Left: RPN proposals.图2：左：RPN提议。 We observed that word detection by the RPN is difficult to accurately predict the horizontal sides of words, since each character within a word is isolated or separated, making it confused to find the start and end locations of a word.我们观察到由RPN进行的单词检测很难准确预测单词的水平边，因为单词中的每个字符都是孤立的或分离的，这使得查找单词的开始和结束位置很混乱。 This reduces the search space, compared to the RPN which predicts 4 coordinates of an object.与预测目标4个坐标的RPN相比，这减少了搜索空间。 Compared to the RPN or Faster R-CNN system [25], our fine-scale detection provides more detailed supervised information that naturally leads to a more accurate detection.与RPN或Faster R-CNN系统[25]相比，我们的细粒度检测提供更详细的监督信息，自然会导致更精确的检测。 Similar to RPN [25], training samples are the anchors, whose locations can be pre computed in input image, so that the training labels of each anchor can be computed from corresponding GT box.与RPN[25]类似，训练样本是锚点，其位置可以在输入图像中预先计算，以便可以从相应的实际边界框中计算每个锚点的训练标签。 We first discuss our fine-scale detection strategy against the RPN and Faster R-CNN system [25].我们首先讨论我们关于RPN和Faster R-CNN系统[25]的细粒度检测策略。 As can be found in Table 1 (left), the individual RPN is difficult to perform accurate text localization, by generating a large amount of false detections (low precision).如表1（左）所示，通过产生大量的错误检测（低精度），单独的RPN难以执行准确的文本定位。 By refining the RPN proposals with a Fast R-CNN detection model [5], the Faster R-CNN system improves localization accuracy considerably, with a F-measure of 0.75.通过使用Fast R-CNN检测模型[5]完善RPN提议，Faster R-CNN系统显著提高了定位精度，其F-measure为0.75。 One observation is that the Faster R-CNN also increases the recall of original RPN.一个观察结果是Faster R-CNN也增加了原始RPN的召回率。 The RPN proposals may roughly localize a major part of a text line or word, but they are not accurate enough by the ICDAR 2013 standard.RPN提议可以粗略定位文本行或文字的主要部分，但根据ICDAR 2013的标准这不够准确。
6	recurrent (16)	[rɪˈkʌrənt]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 Scene text detection, convolutional network, recurrent neural network, anchor mechanism场景文本检测；卷积网络；循环神经网络；锚点机制 Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.然后，我们提出了一种网内循环架构，用于按顺序连接这些细粒度的文本提议，从而允许它们编码丰富的上下文信息。 We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model.我们通过提出一种网络内循环机制争取更进一步，使我们的模型能够直接在卷积映射中检测文本序列，避免通过额外昂贵的CNN检测模型进行进一步的后处理。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.2 Recurrent Connectionist Text Proposals3.2 循环连接文本提议 This has been verified by recent work [9] where a recurrent neural network (RNN) is applied to encode this context information for text recognition.最近的工作已经证实了这一点[9]，其中应用递归神经网络（RNN）来编码用于文本识别的上下文信息。 W is the width of the conv5. $H_t$ is a recurrent internal state that is computed jointly from both current input ($X_t$) and previous states encoded in $H_{t-1}$.$H_t$是从当前输入（$X_t$）和以$H_{t-1}$编码的先前状态联合计算的循环内部状态。 The recurrence is computed by using a non-linear function $\varphi$, which defines exact form of the recurrent model.递归是通过使用非线性函数$\varphi$来计算的，它定义了循环模型的确切形式。 Hence the internal state in RNN hidden layer accesses the sequential context information scanned by all previous windows through the recurrent connection.因此，RNN隐藏层中的内部状态可以访问所有先前窗口通过循环连接扫描的序列上下文信息。 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$.我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection.在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 4.3 Recurrent Connectionist Text Proposals4.3 循环连接文本提议 We discuss impact of recurrent connection on our CTPN.我们讨论循环连接对CTPN的影响。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained.因此，所提出的网内循环机制稍微增加了模型计算，并获得了相当大的性能增益。
7	connectionist (11)	[kə'nekʃənɪst]	Detecting Text in Natural Image with Connectionist Text Proposal Network用连接式文本建议网络检测自然图像中的文本 We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image.我们提出了一种新颖的连接文本提议网络（CTPN），它能够准确定位自然图像中的文本行。 We propose a novel Connectionist Text Proposal Network (CTPN) that directly localizes text sequences in convolutional layers.我们提出了一种新颖的连接文本提议网络（CTPN），它可以直接定位卷积层中的文本序列。 Fig. 1: (a) Architecture of the Connectionist Text Proposal Network (CTPN).图1：（a）连接文本提议网络（CTPN）的架构。 3. Connectionist Text Proposal Network3. 连接文本提议网络 This section presents details of the Connectionist Text Proposal Network (CTPN).本节介绍连接文本提议网络（CTPN）的细节。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.2 Recurrent Connectionist Text Proposals3.2 循环连接文本提议 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$.我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。 4.3 Recurrent Connectionist Text Proposals4.3 循环连接文本提议 We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable.我们提出了连接文本提议网络（CTPN）—— 一种可端到端训练的高效文本检测器。
8	sequential (11)	[sɪˈkwenʃl]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs).每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 (b) The CTPN outputs sequential fixed-width fine-scale text proposals.（b）CTPN输出连续的固定宽度细粒度文本提议。 Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps.其次，我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。 Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision.文本具有强大的序列特征，序列上下文信息对做出可靠决策至关重要。 Text have strong sequential characteristics where the sequential context information is crucial to make a reliable decision.文本具有强大的序列特征，序列上下文信息对做出可靠决策至关重要。 Their results have shown that the sequential context information is greatly facilitate the recognition task on cropped word images.他们的结果表明，序列上下文信息极大地促进了对裁剪的单词图像的识别任务。 To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, $H_t$,为此，我们提出在conv5上设计一个RNN层，它将每个窗口的卷积特征作为序列输入，并在隐藏层中循环更新其内部状态：$H_t$， The sliding-window moves densely from left to right, resulting in $t=1,2,…,W$ sequential features for each row.W是conv5的宽度。 Hence the internal state in RNN hidden layer accesses the sequential context information scanned by all previous windows through the recurrent connection.因此，RNN隐藏层中的内部状态可以访问所有先前窗口通过循环连接扫描的序列上下文信息。 We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information.我们提出了一个网内RNN层，可以优雅地连接顺序文本提议，使其能够探索有意义的上下文信息。
9	bounding (11)	[baundɪŋ]	For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it.对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。 The explicit vertical coordinates are measured by the height and y-axis center of a proposal bounding box.明确的垂直坐标是通过提议边界框的高度和y轴中心来度量的。 We compute relative predicted vertical coordinates ($\textbf{v}$) with respect to the bounding box location of an anchor as,我们计算相对于锚点的边界框位置的相对预测的垂直坐标（$\textbf{v}$），如下所示： Therefore, each predicted text proposal has a bounding box with size of $h\times 16$ (in the input image), as shown in Fig. 1 (b) and Fig. 2 (right).因此，如图1（b）和图2（右）所示，每个预测文本提议都有一个大小为$h\times 16$的边界框（在输入图像中）。 $x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location.$x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 We only use the offsets of the side-proposals to refine the final text line bounding box.我们只使用边缘提议的偏移量来优化最终的文本行边界框。 k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box.k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离（例如32个像素）内的一组锚点。$\textbf{o}_k$和$\textbf{o}_k^*$是与第k个锚点关联的x轴的预测和实际偏移量。 It is defined by computing the IoU overlap with the GT bounding box (divided by anchor location).它通过计算与实际边界框的IoU重叠（除以锚点位置）来定义。 We collected the other images ourselves and manually labelled them with text line bounding boxes.我们自己收集了其他图像，并用文本行边界框进行了手工标注。 This may benefit from joint bounding box regression mechanism of the Fast R-CNN, which improves the accuracy of a predicted bounding box.这可能受益于Fast R-CNN的联合边界框回归机制，其提高了预测边界框的准确性。 This may benefit from joint bounding box regression mechanism of the Fast R-CNN, which improves the accuracy of a predicted bounding box.这可能受益于Fast R-CNN的联合边界框回归机制，其提高了预测边界框的准确性。
10	side-refinement (10)	[!≈ saɪd rɪˈfaɪnmənt]	The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors.RNN层连接到512维的全连接层，接着是输出层，联合预测k个锚点的文本/非文本分数，y轴坐标和边缘调整偏移。 It includes three key contributions that make it reliable and accurate for text localization: detecting text in fine-scale proposals, recurrent connectionist text proposals, and side-refinement.它包括三个关键的贡献，使文本定位可靠和准确：检测细粒度提议中的文本，循环连接文本提议和边缘细化。 3.3 Side-refinement3.3 边缘细化 To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal).为了解决这个问题，我们提出了一种边缘细化的方法，可以精确地估计左右两侧水平方向上的每个锚点/提议的偏移量（称为边缘锚点或边缘提议）。 Several detection examples improved by side-refinement are presented in Fig. 4.通过边缘细化改进的几个检测示例如图4所示。 The side-refinement further improves the localization accuracy, leading to about $2\%$ performance improvements on the SWT and Multi-Lingual datasets.边缘细化进一步提高了定位精度，从而使SWT和Multi-Lingual数据集上的性能提高了约2%。 Notice that the offset for side-refinement is predicted simultaneously by our model, as shown in Fig. 1.请注意，我们的模型同时预测了边缘细化的偏移量，如图1所示。 Fig. 4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement.图4：CTPN检测有（红色框）和没有（黄色虚线框）边缘细化。 q. (2) and side-refinement offset ($\textbf{o}$). We explore k anchors to predict them on each spatial location in the conv5, resulting in 2k, 2k and k parameters in the output layer, respectively.我们将探索k个锚点来预测它们在conv5中的每个空间位置，从而在输出层分别得到2k，2k和k个参数。 We introduce three loss functions, $L^{cl}_s, L^{re}_v and l^{re}_o$, which compute errors of text/non-text score, coordinate and side-refinement, respectively.我们引入了三种损失函数：$L^{cl}_s$，$L^{re}_v$和$l^{re}_o$，其分别计算文本/非文本分数，坐标和边缘细化。
11	F-measure (9)	[!≈ ef ˈmeʒə(r)]	It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin.它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure，大大超过了最近的结果[8，35]。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015).第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 Fourth, our method achieves new state-of-the-art results on a number of benchmarks, significantly improving recent results (e.g., 0.88 F-measure over 0.83 in [8] on the ICDAR 2013, and 0.61 F-measure over 0.54 in [35] on the ICDAR 2015).第四，我们的方法在许多基准数据集上达到了新的最先进成果，显著改善了最近的结果（例如，0.88的F-measure超过了2013年ICDAR的[8]中的0.83，而0.64的F-measure超过了ICDAR2015上[35]中的0.54 ）。 By refining the RPN proposals with a Fast R-CNN detection model [5], the Faster R-CNN system improves localization accuracy considerably, with a F-measure of 0.75.通过使用Fast R-CNN检测模型[5]完善RPN提议，Faster R-CNN系统显著提高了定位精度，其F-measure为0.75。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 On the SWT, our improvements are significant on both recall and F-measure, with marginal gain on precision.在SWT上，我们的改进对于召回和F-measure都非常重要，并在精确度上取得了很小的收益。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88.在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。 It consistently obtains substantial improvements on F-measure and recall.它始终在F-measure和召回率方面取得重大进展。 Regardless of running time, our method outperforms the FASText substantially with $11\%$ improvement on F-measure.无论运行时间如何，我们的方法都大大优于FASText，F-measure的性能提高了11%。
12	SWT (8)	['esd'əbəlju:t'i:]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background.这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。 The side-refinement further improves the localization accuracy, leading to about $2\%$ performance improvements on the SWT and Multi-Lingual datasets.边缘细化进一步提高了定位精度，从而使SWT和Multi-Lingual数据集上的性能提高了约2%。 We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24].我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text.Epshtein等[3]引入了包含307张图像的SWT数据集，其中包含许多极小尺度的文本。 The evaluations on the SWT and Multilingual datasets follow the protocols defined in [3] and [24] respectively.SWT和Multilingual数据集的评估分别遵循[3]和[24]中定义的协议。 Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL.表1：ICDAR 2013的组件评估以及在SWT和MULTILENGUAL数据集上的最新成果。 We set short side of images to 2000 for the SWT and ICDAR 2015, and 600 for the other three.我们为SWT和ICDAR 2015设置图像短边为2000，其他三个的短边为600。 On the SWT, our improvements are significant on both recall and F-measure, with marginal gain on precision.在SWT上，我们的改进对于召回和F-measure都非常重要，并在精确度上取得了很小的收益。
13	GT (8)	[dʒi:'ti:]	$x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location.$x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 $x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location.$x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 Similar to RPN [25], training samples are the anchors, whose locations can be pre computed in input image, so that the training labels of each anchor can be computed from corresponding GT box.与RPN[25]类似，训练样本是锚点，其位置可以在输入图像中预先计算，以便可以从相应的实际边界框中计算每个锚点的训练标签。 It is defined by computing the IoU overlap with the GT bounding box (divided by anchor location).它通过计算与实际边界框的IoU重叠（除以锚点位置）来定义。 A positive anchor is defined as : (i) an anchor that has an > 0.7 IoU overlap with any GT box; or (ii) the anchor with the highest IoU overlap with a GT box.正锚点被定义为：（i）与任何实际边界框具有>0.7的IoU重叠；或者（ii）与实际边界框具有最高IoU重叠。 A positive anchor is defined as : (i) an anchor that has an > 0.7 IoU overlap with any GT box; or (ii) the anchor with the highest IoU overlap with a GT box.正锚点被定义为：（i）与任何实际边界框具有>0.7的IoU重叠；或者（ii）与实际边界框具有最高IoU重叠。 The negative anchors are defined as <0.5 IoU overlap with all GT boxes.负锚点定义为与所有实际边界框具有<0.5的IoU重叠。 As shown in Fig. 6, those challenging ones are detected correctly by our detector, but some of them are even missed by the GT labelling, which may reduce our precision in evaluation.如图6所示，我们的检测器可以正确地检测到那些具有挑战性的图像，但有些甚至会被真实标签遗漏，这可能会降低我们的评估精度。
14	jointly (7)	[dʒɔɪntlɪ]	We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy.我们开发了一个垂直锚点机制，联合预测每个固定宽度提议的位置和文本/非文本分数，大大提高了定位精度。 The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors.RNN层连接到512维的全连接层，接着是输出层，联合预测k个锚点的文本/非文本分数，y轴坐标和边缘调整偏移。 We develop an anchor regression mechanism that jointly predicts vertical location and text/non-text score of each text proposal, resulting in an excellent localization accuracy.我们开发了一个锚点回归机制，可以联合预测每个文本提议的垂直位置和文本/非文本分数，从而获得出色的定位精度。 W is the width of the conv5. $H_t$ is a recurrent internal state that is computed jointly from both current input ($X_t$) and previous states encoded in $H_{t-1}$.$H_t$是从当前输入（$X_t$）和以$H_{t-1}$编码的先前状态联合计算的循环内部状态。 The proposed CTPN has three outputs which are jointly connected to the last FC layer, as shown in Fig. 1 (a).提出的CTPN有三个输出共同连接到最后的FC层，如图1（a）所示。 We employ multi-task learning to jointly optimize model parameters.我们采用多任务学习来联合优化模型参数。 We develop vertical anchor mechanism that jointly predicts precise location and text/non-text score for each proposal, which is the key to realize accurate localization of text.我们开发了垂直锚点机制，联合预测每个提议的精确位置和文本/非文本分数，这是实现文本准确定位的关键。
15	VGG16 (7)		The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27].通过使用非常深的VGG16模型[27]，CTPN的计算效率为0.14s每张图像。 We densely slide a 3×3 spatial window through the last convolutional maps (conv5 ) of the VGG16 model [27].我们通过VGG16模型[27]的最后一个卷积映射（conv5）密集地滑动3×3空间窗口。 Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].此外，通过使用非常深的VGG16模型[27]，这在计算上是高效的，导致了每张图像0.14s的运行时间（在ICDAR 2013上）。 We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models.我们以非常深的16层vggNet（VGG16）[27]为例来描述我们的方法，该方法很容易应用于其他深度模型。 We use a small spatial window, 3×3, to slide the feature maps of last convolutional layer (e.g., the conv5 of the VGG16).我们使用一个小的空间窗口3×3来滑动最后的卷积层特征映射（例如，VGG16的conv5）。 Given an input image, we have $W \times H \times C$ conv5 features maps (by using the VGG16 model), where C is the number of feature maps or channels, and $W \times H$ is the spatial arrangement.给定输入图像，我们有$W \times H \times C$ conv5特征映射（通过使用VGG16模型），其中C是特征映射或通道的数目，并且$W \times H$是空间布置。 We follow the standard practice, and explore the very deep VGG16 model [27] pre-trained on the ImageNet data [26].我们遵循标准实践，并在ImageNet数据[26]上探索预先训练的非常深的VGG16模型[27]。
16	in-network (7)	[!≈ ɪn ˈnetwɜ:k]	Then, an in-network recurrent architecture is proposed to connect these fine-scale text proposals in sequences, allowing them to encode rich context information.然后，我们提出了一种网内循环架构，用于按顺序连接这些细粒度的文本提议，从而允许它们编码丰富的上下文信息。 We strive for a further step by proposing an in-network recurrent mechanism that allows our model to detect text sequence directly in the convolutional maps, avoiding further post-processing by an additional costly CNN detection model.我们通过提出一种网络内循环机制争取更进一步，使我们的模型能够直接在卷积映射中检测文本序列，避免通过额外昂贵的CNN检测模型进行进一步的后处理。 Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps.其次，我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。 Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals.此外，我们的目标是直接在卷积层中编码这些信息，从而实现细粒度文本提议优雅无缝的网内连接。 In our experiments, we first verify the efficiency of each proposed component individually, e.g., the fine-scale text proposal detection or in-network recurrent connection.在我们的实验中，我们首先单独验证每个提议组件的效率，例如细粒度文本提议检测或网内循环连接。 Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained.因此，所提出的网内循环机制稍微增加了模型计算，并获得了相当大的性能增益。 We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information.我们提出了一个网内RNN层，可以优雅地连接顺序文本提议，使其能够探索有意义的上下文信息。
17	generic (7)	[dʒəˈnerɪk]	Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection.然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调，从而实现通用目标检测的最新性能。 In generic object detection, each object has a well-defined closed boundary [2], while such a well-defined boundary may not exist in text, since a text line or word is composed of a number of separate characters or strokes.在通用目标检测中，每个目标都有一个明确的封闭边界[2]，而在文本中可能不存在这样一个明确定义的边界，因为文本行或单词是由许多单独的字符或笔划组成的。 We present several technical developments that tailor generic object detection model elegantly towards our problem.我们提出了几种技术发展，针对我们的问题可以优雅地调整通用目标检测模型。 However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2].然而，实质上文本与普通目标不同，它们通常具有明确的封闭边界和中心，可以从它的一部分推断整个目标[2]。 Obviously, a text line is a sequence which is the main difference between text and generic objects.显然，文本行是一个序列，它是文本和通用目标之间的主要区别。 This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words.这种不准确性在通用目标检测中可能并不重要，但在文本检测中不应忽视，特别是对于那些小型文本行或文字。 This is different from generic object detection where the impact of condition (ii) may be not significant.这不同于通用目标检测，通用目标检测中条件（ii）的影响可能不显著。
18	small-scale (7)	[ˈsmɔ:lˈskeɪl]	This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words.这种不准确性在通用目标检测中可能并不重要，但在文本检测中不应忽视，特别是对于那些小型文本行或文字。 This is crucial to detect small-scale text patterns, which is one of key advantages of the CTPN.这对于检测小规模文本模式至关重要，这是CTPN的主要优势之一。 This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text.这个数据集比以前的数据集更具挑战性，包括任意方向，非常小的尺度和低分辨率的文本。 Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text.Epshtein等[3]引入了包含307张图像的SWT数据集，其中包含许多极小尺度的文本。 It is of great importance for recovering highly ambiguous text (e.g., extremely small-scale ones), which is one of main advantages of our CTPN, as demonstrated in Fig. 6.对于恢复高度模糊的文本（例如极小的文本）来说，这非常重要，这是我们CTPN的主要优势之一，如图6所示。 Fig. 6: CTPN detection results on extremely small-scale cases (in red boxes), where some ground truth boxes are missed.图6：在极小尺度的情况下（红色框内）CTPN检测结果，其中一些真实边界框被遗漏。 This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human.这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。
19	y-coordinate (6)	[ˌwaikəuˈɔ:dinət,-neit]	Then we design k vertical anchors to predict y-coordinates for each proposal.然后，我们设计k个垂直锚点来预测每个提议的y坐标。 Our detector outputs the text/non-text scores and the predicted y-coordinates ($\textbf{v}$) for k anchors at each window location.我们的检测器在每个窗口位置输出k个锚点的文本/非文本分数和预测的y轴坐标（$\textbf{v}$）。 Similar to the y-coordinate prediction, we compute relative offset as,与y坐标预测类似，我们计算相对偏移为： $\textbf{s}_i^=\lbrace 0,1\rbrace$ is the ground truth. j is the index of an anchor in the set of valid anchors for y-coordinates regression, which are defined as follow.$\textbf{s}_i^=\lbrace 0,1\rbrace$是真实值。$j$是$y$坐标回归中有效锚点集合中锚点的索引，定义如下。 $\textbf{v}_j$ and $\textbf{v}_j^$ are the prediction and ground truth y-coordinates associated with the $j-{th}$ anchor.$\textbf{v}_j$和$\textbf{v}_j^$是与第j个锚点关联的预测的和真实的y坐标。 The training labels for the y-coordinate regression ($\textbf{v}^$) and offset regression ($\textbf{o}^$) are computed as E. q. (2) and (4) respectively.y坐标回归（$\textbf{v}^$）和偏移回归（$\textbf{o}^$）的训练标签分别按公式（2）和（4）计算。
20	substantially (5)	[səbˈstænʃəli]	Deep Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6].深度卷积神经网络（CNN）最近已经基本实现了一般物体检测[25，5，6]。 Convolutional Neural Networks (CNN) have recently advanced general object detection substantially [25,5,6].卷积神经网络（CNN）近来在通用目标检测[25，5，6]上已经取得了实质的进步。 However, text differs from generic objects substantially, which generally have a well-defined enclosed boundary and center, allowing inferring whole object from even a part of it [2].然而，实质上文本与普通目标不同，它们通常具有明确的封闭边界和中心，可以从它的一部分推断整个目标[2]。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。 Regardless of running time, our method outperforms the FASText substantially with $11\%$ improvement on F-measure.无论运行时间如何，我们的方法都大大优于FASText，F-measure的性能提高了11%。
21	y-axis (5)	[ˈwaiˌæksis]	The RNN layer is connected to a 512D fully-connected layer, followed by the output layer, which jointly predicts text/non-text scores, y-axis coordinates and side-refinement offsets of k anchors.RNN层连接到512维的全连接层，接着是输出层，联合预测k个锚点的文本/非文本分数，y轴坐标和边缘调整偏移。 We develop a vertical anchor mechanism that simultaneously predicts a text/non-text score and y-axis location of each fine-scale proposal.我们开发了垂直锚点机制，可以同时预测每个细粒度提议的文本/非文本分数和y轴的位置。 The explicit vertical coordinates are measured by the height and y-axis center of a proposal bounding box.明确的垂直坐标是通过提议边界框的高度和y轴中心来度量的。 $c_y^a$ and $h^a$ are the center (y-axis) and height of the anchor box, which can be pre-computed from an input image.$c_y^a$和$h^a$是锚盒的中心（y轴）和高度，可以从输入图像预先计算。 $c_y$ and $h$ are the predicted y-axis coordinates in the input image, while $c^_y$ and $h^$ are the ground truth coordinates.$c_y$和$h$是输入图像中预测的y轴坐标，而$c^_y$和$h^$是实际坐标。
22	Multilingual (5)	[ˌmʌltiˈlɪŋgwəl]	We evaluate the CTPN on five text detection benchmarks, namely the ICDAR 2011 [21], ICDAR 2013 [19], ICDAR 2015 [18], SWT [3], and Multilingual dataset [24].我们在五个文本检测基准数据集上评估CTPN，即ICDAR 2011[21]，ICDAR 2013[19]，ICDAR 2015[18]，SWT[3]和Multilingual[24]数据集。 The Multilingual scene text dataset is collected by [24].Multilingual场景文本数据集由[24]收集。 The evaluations on the SWT and Multilingual datasets follow the protocols defined in [3] and [24] respectively.SWT和Multilingual数据集的评估分别遵循[3]和[24]中定义的协议。 Table 1: Component evaluation on the ICDAR 2013, and State-of-the-art results on the SWT and MULTILINGUAL.表1：ICDAR 2013的组件评估以及在SWT和MULTILENGUAL数据集上的最新成果。 Our detector performs favourably against the TextFlow on the Multilingual, suggesting that our method generalize well to various languages.我们的检测器在Multilingual上比TextFlow表现更好，表明我们的方法能很好地泛化到各种语言。
23	trainable (4)	[t'reɪnəbl]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model.第三，两种方法无缝集成，以符合文本序列的性质，从而形成统一的端到端可训练模型。 Therefore, our integration with the RNN layer is elegant, resulting in an efficient model that is end-to-end trainable without additional cost.因此，我们与RNN层的集成非常优雅，从而形成了一种高效的模型，可以在无需额外成本的情况下进行端到端的训练。 We have presented a Connectionist Text Proposal Network (CTPN) —— an efficient text detector that is end-to-end trainable.我们提出了连接文本提议网络（CTPN）—— 一种可端到端训练的高效文本检测器。
24	computationally (3)	[!≈ ˌkɒmpjuˈteɪʃənli]	The CTPN is computationally efficient with 0.14s/image, by using the very deep VGG16 model [27].通过使用非常深的VGG16模型[27]，CTPN的计算效率为0.14s每张图像。 Furthermore, it is computationally efficient, resulting in a 0.14s/image running time (on the ICDAR 2013) by using the very deep VGG16 model [27].此外，通过使用非常深的VGG16模型[27]，这在计算上是高效的，导致了每张图像0.14s的运行时间（在ICDAR 2013上）。 Another limitation is that the sliding-window methods are computationally expensive, by running a classifier on a huge number of the sliding windows.另一个限制是通过在大量的滑动窗口上运行分类器，滑动窗口方法在计算上是昂贵的。
25	class-agnostic (3)	[!≈ klɑ:s ægˈnɒstɪk]	The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps.最先进的方法是Faster Region-CNN（R-CNN）系统[25]，其中提出了区域提议网络（RPN）直接从卷积特征映射中生成高质量类别不可知的目标提议。 Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5].生成类别不可知目标提议的选择性搜索（SS）[4]是目前领先的目标检测系统中应用最广泛的方法之一，如CNN（R-CNN）[6]及其扩展[5]。 They proposed a Region Proposal Network (RPN) that generates high-quality class-agnostic object proposals directly from the convolutional feature maps.他们提出了一个区域提议网络（RPN），可以直接从卷积特征映射中生成高质量的类别不可知的目标提议。
26	refinement (3)	[rɪˈfaɪnmənt]	Then the RPN proposals are fed into a Fast R-CNN [5] model for further classification and refinement, leading to the state-of-the-art performance on generic object detection.然后将RPN提议输入Faster R-CNN[5]模型进行进一步的分类和微调，从而实现通用目标检测的最新性能。 Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.我们的方法能够在单个过程中处理多尺度和多语言的文本，避免进一步的后过滤或细化。 However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5].然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。
27	elegantly (3)	['elɪɡəntlɪ]	We present several technical developments that tailor generic object detection model elegantly towards our problem.我们提出了几种技术发展，针对我们的问题可以优雅地调整通用目标检测模型。 Second, we propose an in-network recurrence mechanism that elegantly connects sequential text proposals in the convolutional feature maps.其次，我们提出了一种在卷积特征映射中优雅连接序列文本提议的网络内循环机制。 We propose an in-network RNN layer that connects sequential text proposals elegantly, allowing it to explore meaningful context information.我们提出了一个网内RNN层，可以优雅地连接顺序文本提议，使其能够探索有意义的上下文信息。
28	recurrently (3)	[rɪ'kʌrəntlɪ]	The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs).每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 RNN provides a natural choice for encoding this information recurrently using its hidden layers.RNN提供了一种自然选择，使用其隐藏层对这些信息进行循环编码。 To this end, we propose to design a RNN layer upon the conv5, which takes the convolutional feature of each window as sequential inputs, and updates its internal state recurrently in the hidden layer, $H_t$,为此，我们提出在conv5上设计一个RNN层，它将每个窗口的卷积特征作为序列输入，并在隐藏层中循环更新其内部状态：$H_t$，
29	receptive (3)	[rɪˈseptɪv]	The size of conv5 feature maps is determined by the size of input image, while the total stride and receptive field are fixed as 16 and 228 pixels, respectively.conv5特征映射的大小由输入图像的大小决定，而总步长和感受野分别固定为16个和228个像素。 Both the total stride and receptive field are fixed by the network architecture.网络架构决定总步长和感受野。 Generally, an text proposal is largely smaller than its effective receptive field which is $228\times228$.一般来说，文本提议在很大程度上要比它的有效感受野$228\times228$要小。
30	side-proposal (3)	[!≈ saɪd prəˈpəʊzl]	To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal).为了解决这个问题，我们提出了一种边缘细化的方法，可以精确地估计左右两侧水平方向上的每个锚点/提议的偏移量（称为边缘锚点或边缘提议）。 The side-proposals are defined as the start and end proposals when we connect a sequence of detected fine-scale text proposals into a text line.当我们将一系列检测到的细粒度文本提议连接到文本行中时，这些提议被定义为开始和结束提议。 We only use the offsets of the side-proposals to refine the final text line bounding box.我们只使用边缘提议的偏移量来优化最终的文本行边界框。
31	x-axis (3)	[ˈeksˌæksis]	$x^_{side}$ is the ground truth (GT) side coordinate in x-axis, which is pre-computed from the GT bounding box and anchor location.$x^_{side}$是x轴的实际（GT）边缘坐标，它是从实际边界框和锚点位置预先计算的。 $c_x^a$ is the center of anchor in x-axis.$c_x^a$是x轴的锚点的中心。 $\textbf{o}_k$ and $\textbf{o}_k^*$ are the predicted and ground truth offsets in x-axis associated to the $k-{th}$ anchor.$L^{cl}_s$是我们使用Softmax损失区分文本和非文本的分类损失。
32	FTPN (3)	[!≈ ef ti: pi: en]	Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 As shown in Table 1 (left), with our recurrent connection, the CTPN improves the FTPN substantially from a F-measure of 0.80 to 0.88.如表1（左）所示，使用我们的循环连接，CTPN大幅度改善了FTPN，将F-measure从0.80的提高到0.88。
33	FASText (3)	[fɑːs'tekst]	On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88.在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。 FASText [1] achieves 0.15s/image CPU time.FASText[1]达到0.15s每张图像的CPU时间。 Regardless of running time, our method outperforms the FASText substantially with $11\%$ improvement on F-measure.无论运行时间如何，我们的方法都大大优于FASText，F-measure的性能提高了11%。
34	seamlessly (2)	['si:mlisli]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。 Third, both methods are integrated seamlessly to meet the nature of text sequence, resulting in a unified end-to-end trainable model.第三，两种方法无缝集成，以符合文本序列的性质，从而形成统一的端到端可训练模型。
35	connected-component (2)	[!≈ kə'nektɪd kəmˈpəʊnənt]	Their performance heavily rely on the results of character detection, and connected-components methods or sliding-window methods have been proposed.它们的性能很大程度上依赖于字符检测的结果，并且已经提出了连接组件方法或滑动窗口方法。 They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods.它们可以粗略地分为两类，基于连接组件（CC）的方法和基于滑动窗口的方法。
36	sequentially (2)	[sɪ'kwenʃəlɪ]	Furthermore, these false detections are easily accumulated sequentially in bottom-up pipeline, as pointed out in [28].此外，正如[28]所指出的，这些误检很容易在自下而上的过程中连续累积。 Then a text line is constructed by sequentially connecting the pairs having a same proposal.然后通过顺序连接具有相同提议的对来构建文本行。
37	bi-directional (2)	['bɪdɪr'ekʃənl]	The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs).每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 We further extend the RNN layer by using a bi-directional LSTM, which allows it to encode the recurrent context in both directions, so that the connectionist receipt field is able to cover the whole image width, e.g., $228 \times width$.我们通过使用双向LSTM来进一步扩展RNN层，这使得它能够在两个方向上对递归上下文进行编码，以便连接感受野能够覆盖整个图像宽度，例如$228\times width$。
38	BLSTM (2)	[!≈ bi: el es ti: em]	The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs).每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。 The sequential windows in each row are recurrently connected by a Bi-directional LSTM (BLSTM) [7], where the convolutional feature (3×3×C) of each window is used as input of the 256D BLSTM (including two 128D LSTMs).每行的序列窗口通过双向LSTM（BLSTM）[7]循环连接，其中每个窗口的卷积特征（3×3×C）被用作256维的BLSTM（包括两个128维的LSTM）的输入。
39	multi-lingual (2)	[!≈ 'mʌlti ˈlɪŋgwəl]	Our method is able to handle multi-scale and multi-lingual text in a single process, avoiding further post filtering or refinement.我们的方法能够在单个过程中处理多尺度和多语言的文本，避免进一步的后过滤或细化。 The side-refinement further improves the localization accuracy, leading to about $2\%$ performance improvements on the SWT and Multi-Lingual datasets.边缘细化进一步提高了定位精度，从而使SWT和Multi-Lingual数据集上的性能提高了约2%。
40	cc (2)	[ˌsi: ˈsi:]	They can be roughly grouped into two categories, connected-components (CCs) based approaches and sliding-window based methods.它们可以粗略地分为两类，基于连接组件（CC）的方法和基于滑动窗口的方法。 The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3].基于CC的方法通过使用快速滤波器来区分文本和非文本像素，然后通过使用低级属性（例如强度，颜色，梯度等[33，14，32，13，3]）将文本像素贪婪地分为笔划或候选字符。
41	arbitrary (2)	[ˈɑ:bɪtrəri]	Similar to Region Proposal Network (RPN) [25], the CTPN is essentially a fully convolutional network that allows an input image of arbitrary size.类似于区域提议网络（RPN）[25]，CTPN本质上是一个全卷积网络，允许任意大小的输入图像。 This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text.这个数据集比以前的数据集更具挑战性，包括任意方向，非常小的尺度和低分辨率的文本。
42	x-coordinate (2)	['ekskəʊ'ɔ:dnɪt]	For each prediction, the horizontal location (x-coordinates) and k-anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image.对于每个预测，水平位置（x轴坐标）和k个锚点位置是固定的，可以通过将conv5中的空间窗口位置映射到输入图像上来预先计算。 where $x_{side}$ is the predicted x-coordinate of the nearest horizontal side (e.g., left or right side) to current anchor.其中，$x_{side}$是最接近水平边（例如，左边或右边）到当前锚点的预测的x坐标。
43	localizations (2)	[!≈ ˌləʊkəlaɪ'zeɪʃnz]	This further reduces its computation, and at the same time, predicting accurate localizations of the text lines.这进一步减少了计算量，同时预测了文本行的准确位置。 The fine-scale detection and RNN connection are able to predict accurate localizations in vertical direction.细粒度的检测和RNN连接可以预测垂直方向的精确位置。
44	outliers (2)	[aʊt'laɪəz]	This may lead to a number of false detections on non-text objects which have a similar structure as text patterns, such as windows, bricks, leaves, etc. (referred as text-like outliers in [13]).这可能会导致对与文本模式类似的非文本目标的误检，如窗口，砖块，树叶等（在文献[13]中称为类文本异常值）。 As shown in Fig. 3, the context information is greatly helpful to reduce false detections, such as text-like outliers.如图3所示，上下文信息对于减少误检非常有用，例如类似文本的异常值。
45	side-anchor (2)	[!≈ saɪd ˈæŋkə(r)]	To address this problem, we propose a side-refinement approach that accurately estimates the offset for each anchor/proposal in both left and right horizontal sides (referred as side-anchor or side-proposal).为了解决这个问题，我们提出了一种边缘细化的方法，可以精确地估计左右两侧水平方向上的每个锚点/提议的偏移量（称为边缘锚点或边缘提议）。 k is the index of a side-anchor, which is defined as a set of anchors within a horizontal distance (e.g., 32-pixel) to the left or right side of a ground truth text line bounding box.k是边缘锚点的索引，其被定义为在实际文本行边界框的左侧或右侧水平距离（例如32个像素）内的一组锚点。$\textbf{o}_k$和$\textbf{o}_k^*$是与第k个锚点关联的x轴的预测和实际偏移量。
46	organizer (2)	['ɔ:ɡənaɪzə(r)]	We follow previous work by using standard evaluation protocols which are provided by the dataset creators or competition organizers.我们遵循以前的工作，使用由数据集创建者或竞赛组织者提供的标准评估协议。 For the ICDAR 2015, we used the online evaluation system provided by the organizers as in [18].对于ICDAR 2015，我们使用了由组织者提供的在线评估系统[18]。
47	remarkably (2)	[rɪ'mɑ:kəblɪ]	Obviously, the proposed fine-scale text proposal network (FTPN) improves the Faster R-CNN remarkably in both precision and recall, suggesting that the FTPN is more accurate and reliable, by predicting a sequence of fine-scale text proposals rather than a whole text line.显然，所提出的细粒度文本提议网络（FTPN）在精确度和召回率方面都显著改进了Faster R-CNN，表明通过预测一系列细粒度文本提议而不是整体文本行，FTPN更精确可靠。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88.在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。
48	TextFlow (2)		Our detector performs favourably against the TextFlow on the Multilingual, suggesting that our method generalize well to various languages.我们的检测器在Multilingual上比TextFlow表现更好，表明我们的方法能很好地泛化到各种语言。 On the ICDAR 2013, it outperforms recent TextFlow [28] and FASText [1] remarkably by improving the F-measure from 0.80 to 0.88.在ICDAR 2013上，它的性能优于最近的TextFlow[28]和FASText[1]，将F-measure从0.80提高到了0.88。
49	incorporate (1)	[ɪnˈkɔ:pəreɪt]	The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model.序列提议通过循环神经网络自然地连接起来，该网络无缝地结合到卷积网络中，从而形成端到端的可训练模型。
50	surpass (1)	[səˈpɑ:s]	It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8,35] by a large margin.它在ICDAR 2013和2015的基准数据集上达到了0.88和0.61的F-measure，大大超过了最近的结果[8，35]。
51	retrieval (1)	[rɪˈtri:vl]	This is due to its numerous practical applications such as image OCR, multi-language translation, image retrieval, etc. It includes two sub tasks: text detection and recognition.这是由于它的许多实际应用，如图像OCR，多语言翻译，图像检索等。它包括两个子任务：文本检测和识别。
52	variance (1)	[ˈveəriəns]	Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization.文本模式的大变化和高度杂乱的背景构成了精确文本定位的主要挑战。
53	clutter (1)	[ˈklʌtə(r)]	Large variance of text patterns and highly cluttered background pose main challenge of accurate text localization.文本模式的大变化和高度杂乱的背景构成了精确文本定位的主要挑战。
54	verification (1)	[ˌverɪfɪ'keɪʃn]	They commonly start from low-level character or stroke detection, which is typically followed by a number of subsequent steps: non-text component filtering, text line construction and text line verification.它们通常从低级别字符或笔画检测开始，后面通常会跟随一些后续步骤：非文本组件过滤，文本行构建和文本行验证。
55	robustness (1)	[rəʊ'bʌstnəs]	These multi-step bottom-up approaches are generally complicated with less robustness and reliability.这些自底向上的多步骤方法通常复杂，鲁棒性和可靠性较差。
56	MSER (1)	[!≈ em es i: ɑ:(r)]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background.这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。
57	HoG (1)	[hɒg]	These methods commonly explore low-level features (e.g., based on SWT [3,13], MSER [14,33,23], or HoG [28]) to distinguish text candidates from background.这些方法通常探索低级特征（例如，基于SWT[3，13]，MSER[14，33，23]或HoG[28]）来区分候选文本和背景。
58	Region-CNN (1)		The state-of-the-art method is Faster Region-CNN (R-CNN) system [25] where a Region Proposal Network (RPN) is proposed to generate high-quality class-agnostic object proposals directly from convolutional feature maps.最先进的方法是Faster Region-CNN（R-CNN）系统[25]，其中提出了区域提议网络（RPN）直接从卷积特征映射中生成高质量类别不可知的目标提议。
59	PASCAL (1)	['pæskәl]	For object detection, a typical correct detection is defined loosely, e.g., by an overlap of > 0.5 between the detected bounding box and its ground truth (e.g., the PASCAL standard [4]), since people can recognize an object easily from major part of it.对于目标检测，典型的正确检测是松散定义的，例如，检测到的边界框与其实际边界框（例如，PASCAL标准[4]）之间的重叠>0.5，因为人们可以容易地从目标的主要部分识别它。
60	comprehensively (1)	[ˌkɒmprɪˈhensɪvli]	By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word.相比之下，综合阅读文本是一个细粒度的识别任务，需要正确的检测，覆盖文本行或字的整个区域。
61	fine-grained (1)	[faɪn'greɪnd]	By contrast, reading text comprehensively is a fine-grained recognition task which requires a correct detection that covers a full region of a text line or word.相比之下，综合阅读文本是一个细粒度的识别任务，需要正确的检测，覆盖文本行或字的整个区域。
62	leverage (1)	[ˈli:vərɪdʒ]	We leverage the advantages of strong deep convolutional features and sharing computation mechanism, and propose the CTPN architecture which is described in Fig. 1.我们利用强深度卷积特性和共享计算机制的优点，提出了如图1所示的CTPN架构。
63	greedily (1)	['gri:dɪlɪ]	The CCs based approaches discriminate text and non-text pixels by using a fast filter, and then text pixels are greedily grouped into stroke or character candidates, by using low-level properties, e.g., intensity, color, gradient, etc. [33,14,32,13,3].基于CC的方法通过使用快速滤波器来区分文本和非文本像素，然后通过使用低级属性（例如强度，颜色，梯度等[33，14，32，13，3]）将文本像素贪婪地分为笔划或候选字符。
64	robustly (1)	[rəʊ'bʌstlɪ]	Furthermore, robustly filtering out non-character components or confidently verifying detected text lines are even difficult themselves [1,33,14].此外，强大地过滤非字符组件或者自信地验证检测到的文本行本身就更加困难[1，33，14]。
65	inexpensive (1)	[ˌɪnɪkˈspensɪv]	A common strategy is to generate a number of object proposals by employing inexpensive low-level features, and then a strong CNN classifier is applied to further classify and refine the generated proposals.一个常见的策略是通过使用廉价的低级特征来生成许多目标提议，然后使用强CNN分类器来进一步对生成的提议进行分类和细化。
66	Selective (1)	[sɪˈlektɪv]	Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5].生成类别不可知目标提议的选择性搜索（SS）[4]是目前领先的目标检测系统中应用最广泛的方法之一，如CNN（R-CNN）[6]及其扩展[5]。
67	SS (1)	[!≈ es es]	Selective Search (SS) [4] which generates class-agnostic object proposals, is one of the most popular methods applied in recent leading object detection systems, such as Region CNN (R-CNN) [6] and its extensions [5].生成类别不可知目标提议的选择性搜索（SS）[4]是目前领先的目标检测系统中应用最广泛的方法之一，如CNN（R-CNN）[6]及其扩展[5]。
68	discriminative (1)	[dɪs'krɪmɪnətɪv]	However, the RPN proposals are not discriminative, and require a further refinement and classification by an additional costly CNN model, e.g., the Fast R-CNN model [5].然而，RPN提议不具有判别性，需要通过额外的成本高昂的CNN模型（如Fast R-CNN模型[5]）进一步细化和分类。
69	domain-specific (1)	[!≈ dəˈmeɪn spəˈsɪfɪk]	More importantly, text is different significantly from general objects, making it difficult to directly apply general object detection system to this highly domain-specific task.更重要的是，文本与一般目标有很大的不同，因此很难直接将通用目标检测系统应用到这个高度领域化的任务中。
70	applicable (1)	[əˈplɪkəbl]	We take the very deep 16-layer vggNet (VGG16) [27] as an example to describe our approach, which is readily applicable to other deep models.我们以非常深的16层vggNet（VGG16）[27]为例来描述我们的方法，该方法很容易应用于其他深度模型。
71	k-anchor (1)	[!≈ keɪ ˈæŋkə(r)]	For each prediction, the horizontal location (x-coordinates) and k-anchor locations are fixed, which can be pre-computed by mapping the spatial window location in the conv5 onto the input image.对于每个预测，水平位置（x轴坐标）和k个锚点位置是固定的，可以通过将conv5中的空间窗口位置映射到输入图像上来预先计算。
72	suppression (1)	[səˈpreʃn]	The detected text proposals are generated from the anchors having a text/non-text score of >0.7 (with non-maximum suppression).检测到的文本提议是从具有> 0.7（具有非极大值抑制）的文本/非文本分数的锚点生成的。
73	seamless (1)	[ˈsi:mləs]	Furthermore, we aim to encode this information directly in the convolutional layer, resulting in an elegant and seamless in-network connection of the fine-scale text proposals.此外，我们的目标是直接在卷积层中编码这些信息，从而实现细粒度文本提议优雅无缝的网内连接。
74	multiplicative (1)	['mʌltɪplɪkeɪtɪv]	The LSTM was proposed specially to address vanishing gradient problem, by introducing three additional multiplicative gates: the input gate, forget gate and output gate.通过引入三个附加乘法门：输入门，忘记门和输出门，专门提出了LSTM以解决梯度消失问题。
75	inaccuracy (1)	[ɪn'ækjərəsɪ]	This inaccuracy may be not crucial in generic object detection, but should not be ignored in text detection, particularly for those small-scale text lines or words.这种不准确性在通用目标检测中可能并不重要，但在文本检测中不应忽视，特别是对于那些小型文本行或文字。
76	dash (1)	[dæʃ]	Fig. 4: CTPN detection with (red box) and without (yellow dashed box) the side-refinement.图4：CTPN检测有（红色框）和没有（黄色虚线框）边缘细化。
77	Intersection-over-Union (1)	[!≈ ˌɪntəˈsekʃn ˈəʊvə(r) ˈju:niən]	A valid anchor is a defined positive anchor ($\textbf{s}_j^=1$, described below), or has an Intersection-over-Union (IoU) >0.5 overlap with a ground truth text proposal.有效的锚点是定义的正锚点（$\textbf{s}_j^=1$，如下所述），或者与实际文本提议重叠的交并比（IoU）>0.5。
78	empirically (1)	[ɪm'pɪrɪklɪ]	$\lambda_1$ and $\lambda_2$ are loss weights to balance different tasks, which are empirically set to 1.0 and 2.0. $N_{s}$, $N_{v}$ and $N_{o}$ are normalization parameters, denoting the total number of anchors used by $L^{cl}_s$, $L^{re}_v$ and $L^{re}_o$, respectively.$N_{s}$, $N_{v}$和$N_{o}$是标准化参数，表示$L^{cl}_s$，$L^{re}_v$，$L^{re}_o$分别使用的锚点总数。
79	stochastic (1)	[stə'kæstɪk]	The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD).通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。
80	descent (1)	[dɪˈsent]	The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD).通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。
81	SGD (1)	['esdʒ'i:d'i:]	The CTPN can be trained end-to-end by using the standard back-propagation and stochastic gradient descent (SGD).通过使用标准的反向传播和随机梯度下降（SGD），可以对CTPN进行端对端训练。
82	resize (1)	[ˌri:ˈsaɪz]	The input image is resized by setting its short side to 600 for training, while keeping its original aspect ratio.为了训练，通过将输入图像的短边设置为600来调整输入图像的大小，同时保持其原始长宽比。
83	Gaussian (1)	['gaʊsɪən]	We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation.我们通过使用具有0均值和0.01标准差的高斯分布的随机权重来初始化新层（例如，RNN和输出层）。
84	deviation (1)	[ˌdi:viˈeɪʃn]	We initialize the new layers (e.g., the RNN and output layers) by using random weights with Gaussian distribution of 0 mean and 0.01 standard deviation.我们通过使用具有0均值和0.01标准差的高斯分布的随机权重来初始化新层（例如，RNN和输出层）。
85	momentum (1)	[məˈmentəm]	We used 0.9 momentum and 0.0005 weight decay.我们使用0.9的动量和0.0005的重量衰减。
86	Caffe (1)		Our model was implemented in Caffe framework [17].我们的模型在Caffe框架[17]中实现。
87	Incidental (1)	[ˌɪnsɪˈdentl]	The ICDAR 2015 (Incidental Scene Text - Challenge 4) [18] includes 1,500 images which were collected by using the Google Glass.ICDAR 2015年（Incidental Scene Text —— Challenge 4）[18]包括使用Google Glass收集的1500张图像。
88	orientation (1)	[ˌɔ:riənˈteɪʃn]	This dataset is more challenging than previous ones by including arbitrary orientation, very small-scale and low resolution text.这个数据集比以前的数据集更具挑战性，包括任意方向，非常小的尺度和低分辨率的文本。
89	Epshtein (1)		Epshtein et al. [3] introduced the SWT dataset containing 307 images which include many extremely small-scale text.Epshtein等[3]引入了包含307张图像的SWT数据集，其中包含许多极小尺度的文本。
90	marginally (1)	[ˈmɑ:dʒɪnəli]	Therefore, the proposed in-network recurrent mechanism increase model computation marginally, with considerable performance gain obtained.因此，所提出的网内循环机制稍微增加了模型计算，并获得了相当大的性能增益。
91	submission (1)	[səbˈmɪʃn]	In addition, we further compare our method against [8,11,35], which were published after our initial submission.此外，我们进一步与[8,11,35]比较了我们的方法，它们是在我们的首次提交后发布的。
92	consistently (1)	[kən'sɪstəntlɪ]	It consistently obtains substantial improvements on F-measure and recall.它始终在F-measure和召回率方面取得重大进展。
93	capability (1)	[ˌkeɪpəˈbɪləti]	This may due to strong capability of CTPN for detecting extremely challenging text, e.g., very small-scale ones, some of which are even difficult for human.这可能是由于CTPN在非常具有挑战性的文本上具有很强的检测能力，例如非常小的文本，其中一些甚至对人来说都很难。
94	competitively (1)	[!≈ kəmˈpetətɪvli]	By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’ s approach [8] using 0.07s/image with GPU.在ICDAR 2013中，使用450的缩放比例时间降低到0.09s每张图像，同时获得0.92/0.77/0.84的P/R/F，与Gupta等人的方法[8]相比，GPU时间为0.07s每张图像，我们的方法是具有竞争力的。
95	Gupta (1)		By using the scale of 450, it is reduced to 0.09s/image, while obtaining P/R/F of 0.92/0.77/0.84 on the ICDAR 2013, which are compared competitively against Gupta et al.’ s approach [8] using 0.07s/image with GPU.在ICDAR 2013中，使用450的缩放比例时间降低到0.09s每张图像，同时获得0.92/0.77/0.84的P/R/F，与Gupta等人的方法[8]相比，GPU时间为0.07s每张图像，我们的方法是具有竞争力的。

Words List (appearance)

Words List (frequency)